facebookresearch / mlqe

We release a dataset based on Wikipedia sentences and the corresponding translations in 6 different languages along with the scores (scale 1 to 100) generated though human evaluations that represent the quality of the translations.Paper Title Unsupervised Quality Estimation for Neural Machine Translation
Creative Commons Attribution Share Alike 4.0 International
80 stars 14 forks source link

Can you tell me the exact train corpus for one of your provided NMT model? #6

Closed maohbao closed 3 years ago

maohbao commented 3 years ago

Hi Fomicheva,

I want to know the exact corpus that you use to train the en-zh NMT model in the link below, as well as the fairseq parameter that you use during training. Thank you very much!

https://www.quest.dcs.shef.ac.uk/wmt20_files_qe/models_en-zh.tar.gz

mfomicheva commented 3 years ago

Hi,

The corpora we used for training the En-Zh system are (links are provided in the README):

The data was tokenized with moses for En and with jieba tokenizer for Zh. For BPE we used subword-nmt with -s 40000. You can also download a tokenized version of the data from http://www.statmt.org/wmt20/quality-estimation-task.html.

Regarding NMT training. --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas (0.9, 0.98) --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 --lr 0.0005 --clip-norm 0.0 --dropout 0.3 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --weight-decay 0.0 --max-tokens 4096

This was done with 0.8.0 version of fairseq. It is possible you would get different results if you use a different version.

Regarding training,

maohbao commented 3 years ago

Hi mfomicheva,

Thank you very much for the detailed reply! One more detail about the corpus used to train en-zh model:

Did you use the corpus from the link below or not? https://www.quest.dcs.shef.ac.uk/wmt20_files_qe/training_en-zh.tar.gz which is listed on page http://www.statmt.org/wmt20/quality-estimation-task.html

Because you said "/Additional/ parallel data" are:

Thank you very much!

mao

mfomicheva commented 3 years ago

Hi, yes, we did. The data in this link and the corpora listed in Additional parallel data is the same. Sorry for the confusion. I will fix this in the README.

maohbao commented 3 years ago

Thank you!