Preprocessing of training data

Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models

MIT License

323 stars 40 forks source link

Preprocessing of training data #68

Open alexvs-sysoev opened 2 years ago

alexvs-sysoev commented 2 years ago

Hi! I am trying to reproduce the training results of some models. Normalization is used everywhere as a preprocessing. As I understand it, a script is used for this: normilize.sh Does anything else apply besides this. For example, normalization of punctuation or some specific techniques for different languages?

jorgtied commented 2 years ago

The basic preprocessing of training data is defined in https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/preprocess.mk. Scripts in https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/scripts/cleanup are meant to do language-specific things and right now don't do much.