Multilingual preprocessing

sshleifer commented 4 years ago

For this multilingual model (en-ROMANCE), test set I am having trouble running inputs with language codes through marian. Specifically, if I add the codes after running spm_encode, things are fine. If I try to use the preprocessing logic in preprocess.sh, things break.

test set: >>pt<< Don't spend so much time watching TV.

Using preprocess.sh, I get either

▁> > pt < < ▁Don ' t ▁spend ▁so ▁much ▁time ▁watching ▁TV .

or

>>pt<< ▁> > pt < < ▁Don ' t ▁spend ▁so ▁much ▁time ▁watching ▁TV .

but both of those cause marian_decoder to error.

The correct input seems to be

>>pt<< ▁Don ' t ▁spend ▁so ▁much ▁time ▁watching ▁TV .

With the code added after spm_encode is run. Is that correct?

jorgtied commented 4 years ago

Yes, the last line is correct. You should preprocess.sh without the language flag. That will be added after tokenisation - so yes, it should come after spm-encode. preprocess.sh should work if you have plain text as input without the language token. Does that work?

sshleifer commented 4 years ago

Yes, thanks!

Helsinki-NLP / OPUS-MT-train

Multilingual preprocessing #10