I see that the English parts of the dictionaries are different
for example
tokenizer_he.tokenize("housekeeper") outputs
['▁housekeeper']
and
tokenizer_es.tokenize("housekeeper") outputs
['▁house', 'keeper']
I want to know what is the reason for this different
Was it trained on different dataset?
Thank you
Bar
Hi, I use different tokenizers for different languages:
Helsinki-NLP/opus-mt-en-de Helsinki-NLP/opus-mt-en-he Helsinki-NLP/opus-mt-en-ru Helsinki-NLP/opus-mt-en-es
I see that the English parts of the dictionaries are different for example tokenizer_he.tokenize("housekeeper") outputs ['▁housekeeper'] and tokenizer_es.tokenize("housekeeper") outputs ['▁house', 'keeper']
I want to know what is the reason for this different Was it trained on different dataset? Thank you Bar