Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

Bad translation using marian-decoder #32

Open koren-v opened 3 years ago

koren-v commented 3 years ago

Hi, I've loaded the models from the following directory: https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models/ru-en When I tried some of them I often get translation like: "▁Y O O O O O O O O O O O O O O O O O O O O" or "I 'm b@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@" Then I tried to load the model from the Hugging Face site: but get pretty similar outputs while using Hugging Face framework gives good translations. Probably something wrong with config. I launch it using the Marian library. For example:

 echo "привет" | ./marian-decoder -c /path/to/opus_models/opus-2019-12-05-ru-en/decoder.yml

So what can be wrong?

jorgtied commented 3 years ago

This is the old BPE-based model. Try https://object.pouta.csc.fi/OPUS-MT-models/ru-en/opus-2020-02-26.zip (sorry that the released models are sorted by having the most recent one furthest down ....)

koren-v commented 3 years ago

This is the old BPE-based model. Try https://object.pouta.csc.fi/OPUS-MT-models/ru-en/opus-2020-02-26.zip (sorry that the released models are sorted by having the most recent one furthest down ....)

@jorgtied The Hugging Fase has the link with the same name (I mean opus-2020-02-26.zip), probably it the same model. Maybe I miss the preprocessing and postprocessing stages because I didn't use the postprocess.sh and preprocess.sh scripts? If so, can you please explain how to use them?

jorgtied commented 3 years ago

But your command-line call suggests that you are using an older model opus-2019-12-05-ru-en. Also the output looks like it is the BPE model that you use. And, certainly, you need to use the preprocess and postprocess scripts!

koren-v commented 3 years ago

@jorgtied Could you please show the example of using these scripts as I didn't get it from the description? (I've tried a newer model, but got non-expected results because I didn't use preprocessing)

jorgtied commented 3 years ago

For an English-German model something like

echo "Hello world" | ./preprocess.sh deu source.spm | marian-decoder -c decoder.yml --cpu-threads 4 | ./postprocess.sh

but make sure that mosesdecoder and spm_encode are installed and found by the preprocess script. Otherwise you can probably also skip the moses-scripts and just encode with spm_encode

echo "Hello world" | spm_encode --model source.spm | ~/projappl/marian-dev/build/marian-decoder -c decoder.yml --cpu-threads 4 | sed 's/ //g;s/▁/ /g'
koren-v commented 3 years ago

@jorgtied Thanks! For now, I've tried to first tokenize sentence using Hugging Face pretrained tokenizer that I suppose uses the same source.spm as well as target.spm and finally got a good translation. Sorry for my stupid questions, but what exactly I need to install? I mean the sources of both: mosesdecoder and spm_encode. I've just installed spm_encode from repo using vcpkg but can't find any path's matching to paste it into ./preprocess.sh