Closed ADD-eNavarro closed 2 years ago
I have a PR to fix this here: https://github.com/NMZivkovic/BertTokenizers/pull/3
@NMZivkovic Would it be possible to merge my PR? It's a small one line fix on the tokenizer.
PR is merged. The new version (1.10.0) contains this fix. Thanks for contributing!
I have tryed 1.10 and still the word "últimamente" gets tokenized as [UNK]. I guess DanMMSFT's PR solved an issue (thanks, btw!) but not the one causing my problem. Then, tokenizing "últimamente" like ['última', '##mente'] speaks by itself: no 2-chars parts there. So I stick with my first impression, must be something the likes of difference in vocabulary.
Hi again. In my experimentation I have found that the tokens coming out from BertTokenizers are not exactly the same as in Python. One example is the spanish word "últimamente". I run this code in python:
And my result is: ['última', '##mente']
But then in C# I write:
And my result is "[UNK]".
I'm guessing the vocabulary file in your package may not be the same (outdated?) as the one in Huggingface. Please take a look into it.