Issues with tokenizer altering chars

certainlyio / nordic_bert

Pre-trained Nordic models for BERT

Creative Commons Attribution 4.0 International

158 stars 11 forks source link

Issues with tokenizer altering chars #13

Closed christianbv closed 4 years ago

christianbv commented 4 years ago

Hey,

I've downloaded the norwegian model and used in the package hugging-face, but I have some problems with the BertTokenizer class, as it weirdly enough changes all "å" characters to "a", whereas the chars "æ" and "ø" remains the same? Any idea why this may be so?

Thanks, Christian

kasperjunge commented 4 years ago

I have note tried the Norwegian one, but it is not a a problem with the danish pre-trained model? 🤔

kasperjunge commented 4 years ago

Actually I just noticed that the word "ubetrådt" is tokenized as: [ 'ubet', '##rad', '##t']. But it does a good job with the other æøå's

christianbv commented 4 years ago

I managed to fix it, just forgot to close the issue - just use fasttokenizer and disable strip-accents. The danish model is not crooked and has a different vocab file so I guess it is not a problem there :)