TransformerWordEmbeddings using Spanish BERT doesn't work

matirojasg commented 3 years ago

Hello. I have found that I cannot use the TransformerWordEmbeddings class for the Spanish BERT model.

This is the code:

from flair.data import Sentence
from flair.embeddings import TransformerWordEmbeddings

sentence = Sentence('El pasto es verde.')

# use only last layers
embeddings = TransformerWordEmbeddings('dccuchile/bert-base-spanish-wwm-uncased')
embeddings.embed(sentence)
print(sentence[0].embedding.size())

This is the error:

    324         # Set truncation and padding on the backend tokenizer
    325         if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE:
--> 326             self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
    327         else:
    328             self._tokenizer.no_truncation()

OverflowError: int too big to convert

What should I do?

stefan-it commented 3 years ago

Hey @matirojasg ,

I can confirm this bug, there's something strange with the model:

In [5]: tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

In [6]: tokenizer.model_max_length
Out[6]: 1000000000000000019884624838656

In [7]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

In [8]: tokenizer.model_max_length
Out[8]: 512

So it returns a wrong value for model_max_length - for another model like BERTurk it returns the correct value.

I will try to get in contact with the model author and report back here :)

matirojasg commented 3 years ago

Thank you! :)

matirojasg commented 3 years ago

@stefan-it I talked to the author since he is from my university, he is going to change the configuration file.

stefan-it commented 3 years ago

Hi @matirojasg , that would be awesome! I was searching for contact information of the BETO team 😅

So the easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)

matirojasg commented 3 years ago

Hi Stefan,

About this issue, I'm going to change the config files in a pull request.

{ "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 31002 }

https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/config.json

This is actually the config file, what key/value I have to add?

El mar, 23 de mar. de 2021 a la(s) 13:03, Stefan Schweter ( @.***) escribió:

Hi @matirojasg https://github.com/matirojasg , that would be awesome! I was searching for contact information of the BETO team 😅

So the easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flairNLP/flair/issues/2181#issuecomment-805026234, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANQG2L4K4IDUOUFFDOUOGNTTFC3VXANCNFSM4ZVLX65Q .

--

Estudiante de Magister en Ciencias de la Computación Universidad de Chile

stefan-it commented 3 years ago

Hi @matirojasg ,

you can basically just use this tokenizer_config.json file (you don't have to change anything in the config.json file):

https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json

This is also for an uncased model and it additionally specifies the max. length (and sets it to 512).

Hope this helps :)

matirojasg commented 3 years ago

Thank you!

El jue, 25 de mar. de 2021 a la(s) 19:27, Stefan Schweter ( @.***) escribió:

Hi @matirojasg https://github.com/matirojasg ,

you can basically just use this tokenizer_config.json file (you don't have to change anything in the config.json file):

https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json

This is also for an uncased model and it additionally specified the max. length (and sets it to 512).

Hope this helps :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flairNLP/flair/issues/2181#issuecomment-807621626, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANQG2LYQABCPNHKRRZGLD33TFO2ELANCNFSM4ZVLX65Q .

--

Estudiante de Magister en Ciencias de la Computación Universidad de Chile

sokol11 commented 3 years ago

Hi. I just ran into the same issue using the 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext' model. The error went away when I manually set tokenizer.max_model_length = 512. It was set to 1000000000000000019884624838656 by default.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flairNLP / flair

TransformerWordEmbeddings using Spanish BERT doesn't work #2181