facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.11k stars 6.36k forks source link

Data loading in multi-lingual translation #2410

Open wangyong1122 opened 4 years ago

wangyong1122 commented 4 years ago

Hello, I use the multi-lingual translation task and find some issues:

  1. When I use the round_robin_dataset and multi_corpus_sampled_dataset for more than 20 language pairs, the data loading will take much more time in the functions of filter_by_size and batch_by_size. I find that in multi_corpus_sampled_dataset, the computation cost in calling the function of num_tokens is O(N^2). Could you provide some optimization for these two datasets?

  2. The data loading will fail (RuntimeError: received 0 items of ancdata) when I use more language pairs, such as 28 language pairs. I find there is a daemon thread, which supports the buffer mechanism. When I close the buffer mechanism, the data loading can succeed. I do not figure out why this happens. Could you fix this issue? Thank you very much.

tangyuq commented 4 years ago

We have a new multilingual task: translation_multi_simple_epoch. With it you can do the following:

Training

lang_pairs=<language pairs to be trained, e.g. "en-cs,cs-en"> path_2_data= lang_list=

fairseq-train $path_2_data \ --encoder-normalize-before --decoder-normalize-before \ --arch transformer --layernorm-embedding \ --task translation_multi_simple_epoch \ --sampling-method "temperature" \ --sampling-temperature 1.5 \ --encoder-langtok "src" \ --decoder-langtok \ --lang-dict "$lang_list" \ --lang-pairs "$lang_pairs" \ --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \ --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \ --lr-scheduler inverse_sqrt --lr 3e-05 --min-lr -1 --warmup-updates 2500 --max-update 40000 \ --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \ --max-tokens 1024 --update-freq 2 \ --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \ --seed 222 --log-format simple --log-interval 2

Generate


model=<multilingual model>
source_lang=<soure language>
target_lang=<target language>
TOKENIZER=<path to a customized tokenizer for decoding evaluation; it can be cat for sacrebleu>
fairseq-generate $path_2_data \
  --path $model \
  --task translation_multi_simple_epoch \
  --gen-subset test \
  --source-lang $source_lang \
  --target-lang $target_lang
  --sacrebleu --remove-bpe 'sentencepiece'\
  --max-sentences 32 \
  --encoder-langtok "src" \
  --decoder-langtok \
  --lang-dict "$lang_list" \
  --lang-pairs "$lang_pairs" > ${source_lang}_${target_lang}

cat ${source_lang}_${target_lang} | grep -P "^H" |sort -V |cut -f 3- | sed 's/__[a-zA-Z0-9_\-]\+__ //g' |$TOKENIZER $target_lang > ${source_lang}_${target_lang}.hyp
cat ${source_lang}_${target_lang} | grep -P "^T" |sort -V |cut -f 2- | sed 's/__[a-zA-Z0-9_\-]\+__ //g' |$TOKENIZER $target_lang > ${source_lang}_${target_lang}.ref
sacrebleu -tok 'none' -s 'none' ${source_lang}_${target_lang}.ref < ${source_lang}_${target_lang}.hyp
wangyong1122 commented 4 years ago

@tangyuq Thank you very much for your kind feedback.

thpun commented 4 years ago

Good to see new multilingual translation task. So with translation_multi_simple_epoch, looks like we dont need to use multilingual transformer to run the task. My experience is that multilingual transformer's checkpoint is much much larger than any single transformer even when all parameters are shared.

@tangyuq Can you explain more on the difference between the original multilingual translation task & this new task?

tangyuq commented 4 years ago

Thank you for your interest.

I've uploaded a few examples along with description here. The key new features are repeated here:

It can use any transformer models which consume the same batch data structure as in the translation task. However, the drawback is that it currently only supports models shared across all directions. However, it is not difficulty to modify it to support different transformers per direction.

wangyong1122 commented 4 years ago

@tangyuq Hi, Yuqing, when would you release the models from https://arxiv.org/pdf/2008.00401.pdf? Thank you very much.

wangyong1122 commented 3 years ago

Hi, I find that there is a mismatch between the mbart and multilingual. The input source of mbart is: {XX XX XX \</s> \<LangID>}; however, the multilingual model is: {\<LangID> XX XX XX \</s>}. @tangyuq

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

zwhe99 commented 2 years ago

Hi, I find that there is a mismatch between the mbart and multilingual. The input source of mbart is: {XX XX XX }; however, the multilingual model is: { XX XX XX }. @tangyuq

The lang tok style of mbart25 and mbart50 seem to be different as well? The lang tok of mbart25 is [Lang]; however, mbart50 uses a multilingual setting, i.e., __Lang__. (source code). Hi @tangyuq, do you have any comments on this mismatch?