Helsinki-NLP / Tatoeba-Challenge

Other
809 stars 91 forks source link

dataset of Helsinki-NLP/opus-mt-en-zh #20

Closed QzzIsCoding closed 1 year ago

QzzIsCoding commented 2 years ago

Hi, thanks for your model. I have two questions of the train datasets of opus-mt-en-zh.

https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/README-v2021-08-07.md English - Chinese eng-zho 10390 | 43075 | 129323178 Middle English (1100-1500) - Chinese enm-zho

In this website, there are two datasets from en-zh. Which is the dataset of opus-mt-en-zh? When fine-tuning the model, does it need to add ">>cmn_Hans<< " before train_src?

jorgtied commented 2 years ago

Both are used but I guess that enm-zho is very small and will not influence the model very much. For zho the models contain various language variants and they are trained with target language tokens. So, yes, you need to add a prefix to use the model. But I can also imagine that fine-tuning without would probably work where the model then learns to translate without the prefix token.