Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
323 stars 40 forks source link

different sizes of dictionaries in different models #85

Open bariluz93 opened 1 year ago

bariluz93 commented 1 year ago

Hi, I use different tokenizers for different languages:

Helsinki-NLP/opus-mt-en-de Helsinki-NLP/opus-mt-en-he Helsinki-NLP/opus-mt-en-ru Helsinki-NLP/opus-mt-en-es

I see that the English parts of the dictionaries are different for example tokenizer_he.tokenize("housekeeper") outputs ['▁housekeeper'] and tokenizer_es.tokenize("housekeeper") outputs ['▁house', 'keeper']

I want to know what is the reason for this different Was it trained on different dataset? Thank you Bar

jorgtied commented 1 year ago

Yes, all models actually have their own sentence piece model trained on each side of the bitext used for training.