Are posted test sets preprocessed?

Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models

MIT License

318 stars 40 forks source link

Are posted test sets preprocessed? #9

Closed sshleifer closed 3 years ago

sshleifer commented 4 years ago

In the posted test sets, like https://object.pouta.csc.fi/OPUS-MT-models/jap-en/opus-2020-01-09.test.txt,

has the source been run through preprocess.sh?
have the system translations and gold been run through postprocess.sh (assuming yes, given lack of _)?

sshleifer commented 4 years ago

Also some test sets (150+) have empty source entries with populated gold and system translations:

e.g.

from https://object.pouta.csc.fi/OPUS-MT-models/en-st/opus-2020-01-08.test.txt

jorgtied commented 4 years ago

This might not be very consistent at the moment. I changed the procedures for generating that file and recent models should not be pre-processed. Older models might still have pre-processed files. Sorry for the inconvenience.

jorgtied commented 4 years ago

The problem with empty lines most likely comes from noise in the data especially for extremely small data sets. Most models are done in batch runs with automatically selected splits in train/dev/test sets. The risk is that some strange data files will be used for validation and testing. This is a problem but difficult to improve as there are no human-validated test data for most language pairs.

jorgtied commented 4 years ago

Now I realise that the tokenisation issue can also come from the JW300 corpus. Unfortunately, I do not have a proper raw corpus without tokenisation. This is most likely the problem that you see for low-resource languages where test sets seem to be tokenised.