wrong dimension of bert-base-italian-xxl vocabularies

dbmdz / berts

DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models

MIT License

155 stars 12 forks source link

wrong dimension of bert-base-italian-xxl vocabularies #7

Closed f-wole closed 4 years ago

f-wole commented 4 years ago

Hi, thanks again for these models! I was trying to use the bert-base-italian-xxl models, but I noticed that there is a discrepancy between the vocabulary size in the config.json file (32102) and the actual size of the vocabulary (31102). Is it possible that the wrong vocabulary is uploaded?

stefan-it commented 4 years ago

Hi @f-wole

thanks for that hint! Vocab file is correct, but in the config file there's a wrong vocab size. I'll fix that now :)

stefan-it commented 4 years ago

Update on that: unfortunately, I used the vocab size value of 32102 in the configuration for training the model. In order to change fix I would need to re-train the model, which is currently out of my resources.

However, the model is working and I also did all evaluations with the configuration that is deployed on the model hub.

f-wole commented 4 years ago

Yes, I saw that the model expects a vocabulary size of 32102 from the dimension of word_embeddings matrix: embeddings.word_embeddings.weight torch.Size([32102, 768])

So are you suggesting it would be possible to use bert-base-italian-xxl with a vocabulary of size 31102?

stefan-it commented 4 years ago

It is possible, I did evaluations with the NER example script in Hugging Face Transformers library for NER and PoS tagging.

I just updated the README to mention the vocab and config size mismatch :)

Thanks again for finding this!