Dataset used for training the models

Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models

MIT License

312 stars 39 forks source link

Dataset used for training the models #58

Closed maroxtn closed 3 years ago

maroxtn commented 3 years ago

I am wondering why the training data is not linked in the README. For the sake of my research, I need to know the dataset that the model was trained on, particularly for the ar-en and en-ar models.

If it is available somewhere and I missed it, somebody please kindly refer me to it. Thanks!

zhiqihuang commented 1 year ago

Hi I'm facing the same problem. Did you figure out what data did they use to train the models? From the HF model card, it is OPUS data. To me, OPUS is a collection of datasets. How do I find data for a specific language pair, e.g., en-fi?

jorgtied commented 1 year ago

It's true that the data sets are not well documented. Newer models use the compilation from the Tatoeba MT challenge: https://github.com/Helsinki-NLP/Tatoeba-Challenge/. Finding resources for a specific language pair from OPUS can be done with the opusapi: https://opus.nlpl.eu/opusapi/