Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

adding links to the source datasets in benchmarks #17

Open stas00 opened 4 years ago

stas00 commented 4 years ago
  1. Currently it's hard to tell which datasets were used for the benchmark results posted here: https://huggingface.co/Helsinki-NLP/opus-mt-ru-en (and the other models from your user).

After quite some digging I derived these:

  1. There is also an ambiguity about the "year" used in the dataset names in the benchmark.
    [...]
    |newstest2015-enru.ru.en |30.4 |0.568|
    |newstest2016-enru.ru.en |30.1 |0.565|
    [...]
    newstest2019-ruen.ru.en     |31.4   |0.576
    Tatoeba.ru.en   | 61.1  |0.736

is newstest2016-enru.ru.en referring to wmt16 or crawl news corpus that includes data from 2016 (i.e. wmt17)?

Thank you.

p.s. I originally posted about it here and was recommended to file an issue here instead.

jorgtied commented 3 years ago

The test sets are copied in https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/testsets Is that good enough?

stas00 commented 3 years ago

oh, sorry I missed your reply.

it's great, but it'd help a lot to have a link from README.md to that part of github - as typically repos don't include data that I had no idea to look there.

Thank you!