dccuchile / beto

BETO - Spanish version of the BERT model
Creative Commons Attribution 4.0 International
492 stars 63 forks source link

Config file missed max_len #16

Closed matirojasg closed 3 years ago

matirojasg commented 3 years ago

Hi, there's something strange with the model using transformers library:

In [5]: tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

In [6]: tokenizer.model_max_length
Out[6]: 1000000000000000019884624838656

In [7]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

In [8]: tokenizer.model_max_length
Out[8]: 512

So it returns a wrong value for model_max_length - for another model like BERTurk it returns the correct value.

The easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)

Reference

josecannete commented 3 years ago

Hi @matirojasg,

Thank you for reporting this issue.

It should be resolved now on both models.

Also all the configs were updated and added support for both fast and legacy tokenizers and for both frameworks, PyTorch and Tensorflow.

Regards