adding links to the source datasets in benchmarks

stas00 commented 4 years ago

Currently it's hard to tell which datasets were used for the benchmark results posted here: https://huggingface.co/Helsinki-NLP/opus-mt-ru-en (and the other models from your user).

After quite some digging I derived these:

all but last entry: http://opus.nlpl.eu/WMT-News.php and maybe the original http://www.statmt.org/wmt19/
last entry: http://opus.nlpl.eu/Tatoeba.php and maybe the original https://tatoeba.org/eng/ I hope this is correct.

There is also an ambiguity about the "year" used in the dataset names in the benchmark.

[...]
|newstest2015-enru.ru.en |30.4 |0.568|
|newstest2016-enru.ru.en |30.1 |0.565|
[...]
newstest2019-ruen.ru.en     |31.4   |0.576
Tatoeba.ru.en   | 61.1  |0.736

is newstest2016-enru.ru.en referring to wmt16 or crawl news corpus that includes data from 2016 (i.e. wmt17)?

Thank you.

p.s. I originally posted about it here and was recommended to file an issue here instead.

jorgtied commented 4 years ago

The test sets are copied in https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/testsets Is that good enough?

stas00 commented 4 years ago

oh, sorry I missed your reply.

it's great, but it'd help a lot to have a link from README.md to that part of github - as typically repos don't include data that I had no idea to look there.

Thank you!

Helsinki-NLP / OPUS-MT-train

adding links to the source datasets in benchmarks #17