Open alexvs-sysoev opened 2 years ago
The basic preprocessing of training data is defined in https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/preprocess.mk. Scripts in https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/scripts/cleanup are meant to do language-specific things and right now don't do much.
Hi! I am trying to reproduce the training results of some models. Normalization is used everywhere as a preprocessing. As I understand it, a script is used for this: normilize.sh Does anything else apply besides this. For example, normalization of punctuation or some specific techniques for different languages?