NMZivkovic / BertTokenizers

Open source project for BERT Tokenizers in C#.
MIT License
83 stars 22 forks source link

Multilingual model tokenization differs from Python #4

Closed ADD-eNavarro closed 2 years ago

ADD-eNavarro commented 2 years ago

Hi again. In my experimentation I have found that the tokens coming out from BertTokenizers are not exactly the same as in Python. One example is the spanish word "últimamente". I run this code in python:

from transformers import (BertTokenizer)
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)
print(tokenizer.tokenize("últimamente"))

And my result is: ['última', '##mente']

But then in C# I write:

BertMultilingualTokenizer tokenizer = new BertMultilingualTokenizer();
var t = tokenizer.Tokenize("últimamente");
Console.WriteLine($"{t[1].Item1}");

And my result is "[UNK]".

I'm guessing the vocabulary file in your package may not be the same (outdated?) as the one in Huggingface. Please take a look into it.

DanMMSFT commented 2 years ago

I have a PR to fix this here: https://github.com/NMZivkovic/BertTokenizers/pull/3

@NMZivkovic Would it be possible to merge my PR? It's a small one line fix on the tokenizer.

NMZivkovic commented 2 years ago

PR is merged. The new version (1.10.0) contains this fix. Thanks for contributing!

ADD-eNavarro commented 2 years ago

I have tryed 1.10 and still the word "últimamente" gets tokenized as [UNK]. I guess DanMMSFT's PR solved an issue (thanks, btw!) but not the one causing my problem. Then, tokenizing "últimamente" like ['última', '##mente'] speaks by itself: no 2-chars parts there. So I stick with my first impression, must be something the likes of difference in vocabulary.