Open ZJaume opened 1 year ago
Thanks for pointing this out. I should adjust the post-processing script. Does spm_decode do something else that i was not aware of? I never really understood why it would need this separate program and the model as argument. Maybe I am missing something important?
I don't know if it does anything else but I personally prefer to use it. It seems that is also removing trailing and duplicate spaces, see https://github.com/google/sentencepiece/issues/650.
Using
Tatoeba-MT-models/gmq-eng/opusTCv20210807+bt_transformer-big_2022-03-09
to translate WMT21 test set for Icelandic. The postprocess scriptdoes not take into account starting spaces inserted by SentencePiece.
These sentences from WMT21 test set Icelandic:
Are tokenized like this (with the preprocess.sh script)
and then the postprocess generates starting spaces when replaces the
▁
.I think this is due to SentencePiece having the corresponding tokens learned only with the
▁
at the beginning. If I usespm_decode -m model.spm
this does not happen because SP takes care of it.