Helsinki-NLP / Tatoeba-Challenge

Other
808 stars 91 forks source link

Prediction errors using pre-trained eng-kor model #21

Open amesval opened 2 years ago

amesval commented 2 years ago

Hi everyone. I would like to make some inferences and replicate the reported BLEU Score for the English to Korean Translation model (https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-kor). I downloaded the files there and installed marian-nmt on Ubuntu 20.04.3, including protobuf to use Sentencepiece as required in https://marian-nmt.github.io/docs/ . I ran the preprocess.sh, then with its output ran marian-decoder to get the translations, and finally ran the postprocess.sh. The results were unexpected, in fact there where no Korean characters at all.

Am I doing something wrong?

jorgtied commented 2 years ago

Ah, I forgot to include the vocab files that are mentioned in the decoder.yml. Thanks for pointing me into that direction. The *.vocab.yml file is not the correct one here as this model comes with separate vocabularies for source and target language. Look into the decoder.yml file to see that this is the case. But the *.vocab files mentioned there are missing. However, you can use the spm-files directly. Edit the decoder.yml file to look like this:

relative-paths: true
models:
  - opusTCv20210807+bt.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz
vocabs:
  - source.spm
  - target.spm
beam-size: 6
normalize: 1
word-penalty: 0
mini-batch: 1
maxi-batch: 1
maxi-batch-sort: src

And then run something like that:

echo "This is a test." | ./preprocess.sh eng source.spm | marian-decoder -c decoder.yml

Does that work for you?