Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

How to download training data? #14

Closed sshleifer closed 3 years ago

sshleifer commented 4 years ago

It seems like make data is looking for /projappl/nlpl/data/OPUS/*/latest/xml/en-ro.xml.gz I can fix the path, but I think I will still need to download en-ro.xml.gz. Could you provide instructions for how to do that?

I found the opustools command opus_express -s en -t ro, is that the data the models were trained on?

jorgtied commented 4 years ago

The scripts are still very much hard-coded to our computing environment. I'll try to make them more generic and generally useful as soon as possible. Stay tuned .... OpusTools can definitely be used for retrieving the data. I didn't use opus_express to make the data sets but that would actually be a good way of doing it. So far, I rely on local copies of OPUS that I can retrieve quickly. I can tell you that I also work an improved way of handling language codes and the variation that you can find in OPUS This causes quite some confusion and I hope to improve the situation in some future release.

luofuli commented 4 years ago

Can we use the dataset from https://github.com/Helsinki-NLP/Tatoeba-Challenge ?

jorgtied commented 3 years ago

Now the scripts support fetching the data directly from OPUS. The opus-tools python package needs to be installed. Using data from the Tatoeba-Challenge can easily be used by adding the suffix -tatoeba to the build targets. See this for more info: https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md