Closed christianbv closed 4 years ago
I have note tried the Norwegian one, but it is not a a problem with the danish pre-trained model? 🤔
Actually I just noticed that the word "ubetrådt" is tokenized as: [ 'ubet', '##rad', '##t']. But it does a good job with the other æøå's
I managed to fix it, just forgot to close the issue - just use fasttokenizer and disable strip-accents. The danish model is not crooked and has a different vocab file so I guess it is not a problem there :)
Hey,
I've downloaded the norwegian model and used in the package hugging-face, but I have some problems with the BertTokenizer class, as it weirdly enough changes all "å" characters to "a", whereas the chars "æ" and "ø" remains the same? Any idea why this may be so?
Thanks, Christian