Closed santhoshtr closed 4 years ago
The best way would be to implement some python libraries that can do language specific cleanup things. In that way, I could call those methods from another script that can filter any bitext that includes one of those languages with a cleanup function. This would then be easy to integrate in Makefile.data
where all kinds of pre-processing happens. There is now already a script (bitext-match-lang.py) that does language identification as another pre-processing step. That helps quite a lot already and I started to train some new models after improved filtering.
We also work on a package called opus-filter that will make it easier to select proper data from larger noisy data sets. It is already released but needs some more tweaking to make it easier to make it applicable in the general case.
Thanks, I spend some time, but the functioning of Makefile.data and related files were difficult to understand. Do you mind if I share a clean up script here and help to add that codebase?
This is a sed script https://gist.github.com/santhoshtr/1d2143ed5a4987b31c8c1a2c17564263
Ideally this script need to run on raw parallel text before any processing.
Yes, those makefiles are very much research-in-progress material. I'll be happy to help with integrating your cleanup scripts. Whatever additional scripts / libraries you create I can then find ways of integrating them in the data processing pipeline.
The sed script is included now and the makefile includes some routines to read cleanup scripts if available.
(This is a question, please redirect me if this is not the right place to ask)
I observed the test data set and train dataset can be greatly improved if we do an automatic cleanup and normalization(language specific). For example, consider this MT output for en-ml "എന് റെ വീട് ഇന്ത്യയിലാണ്." Here, the space in that bold content is unwanted. This is a known issue in most of the Malayalam content found in web. I found these kind of issues in training and testing data.
If I want to fix this, where exactly I need to add a cleanup code?