Corpus clean up and normalization

Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models

MIT License

321 stars 40 forks source link

Corpus clean up and normalization #3

Closed santhoshtr closed 4 years ago

santhoshtr commented 4 years ago

(This is a question, please redirect me if this is not the right place to ask)

I observed the test data set and train dataset can be greatly improved if we do an automatic cleanup and normalization(language specific). For example, consider this MT output for en-ml "എന് റെ വീട് ഇന്ത്യയിലാണ്." Here, the space in that bold content is unwanted. This is a known issue in most of the Malayalam content found in web. I found these kind of issues in training and testing data.

If I want to fix this, where exactly I need to add a cleanup code?

jorgtied commented 4 years ago

The best way would be to implement some python libraries that can do language specific cleanup things. In that way, I could call those methods from another script that can filter any bitext that includes one of those languages with a cleanup function. This would then be easy to integrate in Makefile.data where all kinds of pre-processing happens. There is now already a script (bitext-match-lang.py) that does language identification as another pre-processing step. That helps quite a lot already and I started to train some new models after improved filtering.

We also work on a package called opus-filter that will make it easier to select proper data from larger noisy data sets. It is already released but needs some more tweaking to make it easier to make it applicable in the general case.

santhoshtr commented 4 years ago

Thanks, I spend some time, but the functioning of Makefile.data and related files were difficult to understand. Do you mind if I share a clean up script here and help to add that codebase?

This is a sed script https://gist.github.com/santhoshtr/1d2143ed5a4987b31c8c1a2c17564263

Ideally this script need to run on raw parallel text before any processing.

jorgtied commented 4 years ago

Yes, those makefiles are very much research-in-progress material. I'll be happy to help with integrating your cleanup scripts. Whatever additional scripts / libraries you create I can then find ways of integrating them in the data processing pipeline.

jorgtied commented 4 years ago

The sed script is included now and the makefile includes some routines to read cleanup scripts if available.