Helsinki-NLP / Tatoeba-Challenge

Other
808 stars 91 forks source link

Where is the testset data? #26

Closed David-hg closed 2 years ago

David-hg commented 2 years ago

Hi, I am training an eng-spa model and I would like to compare it to the ones already available. The problem is that the only testset file I can find is this one, which is not divided like the results obtained in this model. Where can I find the testset divided like this?

jorgtied commented 2 years ago

Check here: https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/data/release/test (or the tar-ball with all test sets: https://object.pouta.csc.fi/Tatoeba-Challenge-devtest/test.tar)

David-hg commented 2 years ago

Thanks for the quick reply. I now have all the tatoeba test sets but in most models there are more testsets used like newstest or tico19. is there any common version that can be used? I have found the tico19 test set in OPUS but it contains 3.1k sentences instead of the 2.1k it indicates in the benchmark from the model I mentioned in the first post.

jorgtied commented 2 years ago

Ah - I see - you mean the other test sets. Look here: https://github.com/Helsinki-NLP/OPUS-MT-testsets/

David-hg commented 2 years ago

Thank you very much, this was exactly what I was looking for!