Can you tell me the exact train corpus for one of your provided NMT model?

maohbao commented 3 years ago

Hi Fomicheva,

I want to know the exact corpus that you use to train the en-zh NMT model in the link below, as well as the fairseq parameter that you use during training. Thank you very much!

https://www.quest.dcs.shef.ac.uk/wmt20_files_qe/models_en-zh.tar.gz

mfomicheva commented 3 years ago

Hi,

The corpora we used for training the En-Zh system are (links are provided in the README):

News Commentary v14
Wiki Titles v1
UN Parallel Corpus V1.0
CWMT Corpus (casia2015, datum2015, datum2017, NEU)

The data was tokenized with moses for En and with jieba tokenizer for Zh. For BPE we used subword-nmt with -s 40000. You can also download a tokenized version of the data from http://www.statmt.org/wmt20/quality-estimation-task.html.

Regarding NMT training. --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas (0.9, 0.98) --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 --lr 0.0005 --clip-norm 0.0 --dropout 0.3 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --weight-decay 0.0 --max-tokens 4096

This was done with 0.8.0 version of fairseq. It is possible you would get different results if you use a different version.

Regarding training,

maohbao commented 3 years ago

Hi mfomicheva,

Thank you very much for the detailed reply! One more detail about the corpus used to train en-zh model:

Did you use the corpus from the link below or not? https://www.quest.dcs.shef.ac.uk/wmt20_files_qe/training_en-zh.tar.gz which is listed on page http://www.statmt.org/wmt20/quality-estimation-task.html

Because you said "/Additional/ parallel data" are:

News Commentary v14
Wiki Titles v1
UN Parallel Corpus V1.0
CWMT Corpus (casia2015, datum2015, datum2017, NEU)

Thank you very much!

mao

mfomicheva commented 3 years ago

Hi, yes, we did. The data in this link and the corpora listed in Additional parallel data is the same. Sorry for the confusion. I will fix this in the README.

maohbao commented 3 years ago

Thank you!

facebookresearch / mlqe

Can you tell me the exact train corpus for one of your provided NMT model? #6