Hugging-Face-Supporter / tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels
Apache License 2.0
9 stars 4 forks source link

Tokenizer's `model_max_length` is not consistent #3

Open MarkusSagen opened 2 years ago

MarkusSagen commented 2 years ago

Most tokenizers define their max model length as either 510 tokens or more and is based on:

Example

Most tokenizers follow this convention, but there are some that have nearly infinite length, with tokenizer.model_max_length=1000000000000000019884624838656

This means that when converting the tokenizer max length, in Tensorflow, most values are assumed to be ints, but with nearly infinit model length, it needs to be a tf.long or greater for the conversion not to fail


Initially, the tokenizers model_max_length was set dynamically, but is now set to 510 tokens. This should be changed to reflect the actual tokenizers.