Closed maohbao closed 3 years ago
Hi,
The corpora we used for training the En-Zh system are (links are provided in the README):
The data was tokenized with moses for En and with jieba tokenizer for Zh. For BPE we used subword-nmt with -s 40000
. You can also download a tokenized version of the data from http://www.statmt.org/wmt20/quality-estimation-task.html.
Regarding NMT training. --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas (0.9, 0.98) --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 --lr 0.0005 --clip-norm 0.0 --dropout 0.3 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --weight-decay 0.0 --max-tokens 4096
This was done with 0.8.0 version of fairseq. It is possible you would get different results if you use a different version.
Regarding training,
Hi mfomicheva,
Thank you very much for the detailed reply! One more detail about the corpus used to train en-zh model:
Did you use the corpus from the link below or not? https://www.quest.dcs.shef.ac.uk/wmt20_files_qe/training_en-zh.tar.gz which is listed on page http://www.statmt.org/wmt20/quality-estimation-task.html
Because you said "/Additional/ parallel data" are:
Thank you very much!
mao
Hi, yes, we did. The data in this link and the corpora listed in Additional parallel data
is the same. Sorry for the confusion. I will fix this in the README.
Thank you!
Hi Fomicheva,
I want to know the exact corpus that you use to train the en-zh NMT model in the link below, as well as the fairseq parameter that you use during training. Thank you very much!
https://www.quest.dcs.shef.ac.uk/wmt20_files_qe/models_en-zh.tar.gz