Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
321 stars 40 forks source link

question about SPM and BPE #104

Closed jemesome closed 1 month ago

jemesome commented 1 month ago

Noob here, running the option2 using docker to install the tornado web server with several language models as listed in the readme here. I ran the dockerfile after cloning from here .

What is the difference between models in OPUS-MT-train and models in the Tatoeba-Challenge ? Looking at the benchmark on test set for for the Spanish to english model, the ES-EN model trained on tatoeba dataset has lower metric compared to the same model in Tatoeba- challenge (Spa-eng).

are these 2 the same model trained on different dataset? is the model based on Tatoeba newer and better compared to the one in OPUS-MT-train as suggested by the benchmark metric?

jorgtied commented 1 month ago

Most of the Tatoeba models are better than the earlier opus-mt models. The training data is a bit cleaner, sometimes also a bit bigger. Models are also sometimes bigger than the original OPUS-MT models (transformer-big instead of transformer-base). But there are different variants and you need to check the model card / README.