Config file missed max_len

dccuchile / beto

BETO - Spanish version of the BERT model

Creative Commons Attribution 4.0 International

492 stars 63 forks source link

Hi, there's something strange with the model using transformers library:

In [5]: tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

In [6]: tokenizer.model_max_length
Out[6]: 1000000000000000019884624838656

In [7]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

In [8]: tokenizer.model_max_length
Out[8]: 512

So it returns a wrong value for model_max_length - for another model like BERTurk it returns the correct value.

The easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)

Reference

dccuchile / beto

Config file missed max_len #16