Closed matirojasg closed 3 years ago
Hey @matirojasg ,
I can confirm this bug, there's something strange with the model:
In [5]: tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
In [6]: tokenizer.model_max_length
Out[6]: 1000000000000000019884624838656
In [7]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
In [8]: tokenizer.model_max_length
Out[8]: 512
So it returns a wrong value for model_max_length
- for another model like BERTurk it returns the correct value.
I will try to get in contact with the model author and report back here :)
Thank you! :)
@stefan-it I talked to the author since he is from my university, he is going to change the configuration file.
Hi @matirojasg , that would be awesome! I was searching for contact information of the BETO team 😅
So the easiest way would be to extend the tokenizer_config.json
and add a "max_len": 512
option :)
Hi Stefan,
About this issue, I'm going to change the config files in a pull request.
{ "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 31002 }
https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/config.json
This is actually the config file, what key/value I have to add?
El mar, 23 de mar. de 2021 a la(s) 13:03, Stefan Schweter ( @.***) escribió:
Hi @matirojasg https://github.com/matirojasg , that would be awesome! I was searching for contact information of the BETO team 😅
So the easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flairNLP/flair/issues/2181#issuecomment-805026234, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANQG2L4K4IDUOUFFDOUOGNTTFC3VXANCNFSM4ZVLX65Q .
--
Estudiante de Magister en Ciencias de la Computación Universidad de Chile
Hi @matirojasg ,
you can basically just use this tokenizer_config.json
file (you don't have to change anything in the config.json
file):
https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json
This is also for an uncased model and it additionally specifies the max. length (and sets it to 512).
Hope this helps :)
Thank you!
El jue, 25 de mar. de 2021 a la(s) 19:27, Stefan Schweter ( @.***) escribió:
Hi @matirojasg https://github.com/matirojasg ,
you can basically just use this tokenizer_config.json file (you don't have to change anything in the config.json file):
https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json
This is also for an uncased model and it additionally specified the max. length (and sets it to 512).
Hope this helps :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flairNLP/flair/issues/2181#issuecomment-807621626, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANQG2LYQABCPNHKRRZGLD33TFO2ELANCNFSM4ZVLX65Q .
--
Estudiante de Magister en Ciencias de la Computación Universidad de Chile
Hi. I just ran into the same issue using the 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext' model. The error went away when I manually set tokenizer.max_model_length = 512
. It was set to 1000000000000000019884624838656
by default.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello. I have found that I cannot use the
TransformerWordEmbeddings
class for the Spanish BERT model.This is the code:
This is the error:
What should I do?