Helsinki-NLP / Opus-MT

Open neural machine translation models and web services
MIT License
592 stars 71 forks source link

missing keys for some low-resource language pairs #19

Open sshleifer opened 4 years ago

sshleifer commented 4 years ago

On the ber-es transformer, if I run:

spm_encode --model source.spm <<< "Bessif kanay."

you get:

▁Be ssif ▁kan ▁ay .

But ▁Be is not in opus.spm32k-spm32k.vocab.yml, so my python tokenizer raises a KeyError when it encounters these tokens.

This doesn't change if I run preprocess.sh first. When I run the pieced sequence through marian_decoder I get a good translation, no error.

This happens for other model,character combos, here is a list of (pair, missing key) from a random sample of models I tested.

{'ha-en': '|',
 'ber-es': '▁Be',
 'pis-fi': '▁|',
 'es-mt': '|',
 'fr-he': '₫',
 'niu-sv': 'OGI',
 'fi-fse': '▁rentou',
 'fi-mh': '|',
 'hr-es': '|',
 'fr-ber': '▁devr',
 'ase-en': 'olos',
 'sv-uk': '|'}

Is this expected? Should my encoder use the id in these cases?

jorgtied commented 4 years ago

This is indeed strange. I don't really know what is happening in those cases. I need to investigate this. The only reason I can think of is that the sentencepiece model is trained on different data than the ones I use for creating the vocabulary. That can, indeed, happen as I leave the sentencepiece model constant even if I augment the data, but the basic data set should always be included. I have no immediate answer on that ...