Helsinki-NLP / Opus-MT

Open neural machine translation models and web services
MIT License
574 stars 71 forks source link

SentencePiece normalization #79

Open ZJaume opened 1 year ago

ZJaume commented 1 year ago

Talking about other things that SentencePiece does, it has some other features that may replace pre-post-process.sh scripts. By default it applies NFKC normalization, but can be customized. The default normalization already does some of the preprocess.sh like:

echo "2" | spm_encode --model isen.student.base/vocab.spm
▁2

If the user needs to add more normalization or change it, it can be borrowed from here https://github.com/google/sentencepiece/tree/master/data, modify it and provide it in the spm_train step and forget about preprocessing.