Closed maroxtn closed 3 years ago
Hi I'm facing the same problem. Did you figure out what data did they use to train the models? From the HF model card, it is OPUS data. To me, OPUS is a collection of datasets. How do I find data for a specific language pair, e.g., en-fi?
It's true that the data sets are not well documented. Newer models use the compilation from the Tatoeba MT challenge: https://github.com/Helsinki-NLP/Tatoeba-Challenge/. Finding resources for a specific language pair from OPUS can be done with the opusapi: https://opus.nlpl.eu/opusapi/
I am wondering why the training data is not linked in the README. For the sake of my research, I need to know the dataset that the model was trained on, particularly for the ar-en and en-ar models.
If it is available somewhere and I missed it, somebody please kindly refer me to it. Thanks!