missing keys for some low-resource language pairs

Helsinki-NLP / Opus-MT

Open neural machine translation models and web services

MIT License

592 stars 71 forks source link

On the ber-es transformer, if I run:

spm_encode --model source.spm <<< "Bessif kanay."

you get:

▁Be ssif ▁kan ▁ay .

But ▁Be is not in opus.spm32k-spm32k.vocab.yml, so my python tokenizer raises a KeyError when it encounters these tokens.

This doesn't change if I run preprocess.sh first. When I run the pieced sequence through marian_decoder I get a good translation, no error.

This happens for other model,character combos, here is a list of (pair, missing key) from a random sample of models I tested.

{'ha-en': '|',
 'ber-es': '▁Be',
 'pis-fi': '▁|',
 'es-mt': '|',
 'fr-he': '₫',
 'niu-sv': 'OGI',
 'fi-fse': '▁rentou',
 'fi-mh': '|',
 'hr-es': '|',
 'fr-ber': '▁devr',
 'ase-en': 'olos',
 'sv-uk': '|'}

Is this expected? Should my encoder use the id in these cases?

Helsinki-NLP / Opus-MT

missing keys for some low-resource language pairs #19