Multilingual Transformer with shared decoder

MaksymDel commented 5 years ago

Hi!

If we share decoder parameters in the multilingual transformer, we need to tell shared decoder in which language to decode.

It might be done by (embedding and) passing target language id directly to the decoder.

Alternatively, one might need to append this language tag to actual sentence (so that it will be e.g. first word in the sentence).

How is it done in fairseq's multilingual transformer?

Thank you, Maksym

madaanpulkit commented 5 years ago

How do you use it for multiple target languages? The example only covers mulitple sources and one target language.

pipibjc commented 5 years ago

We added --decoder-langtok supports in #620. You can specify --decoder-langtok in both training and inference. It feeds the target language token as the first token to decoder.

madaanpulkit commented 5 years ago

We added --decoder-langtok supports in #620. You can specify --decoder-langtok in both training and inference. It feeds the target language token as the first token to decoder.

@pipibjc can you please add an example of many-to-many multilingual translation case, right now the example only covers many-to-one scenario.

pipibjc commented 5 years ago

@madaanpulkit sure, I have some draft example about how to train a many-to-many multilingual translation model, but I need to clean it up a bit. I will update the example page shortly.

madaanpulkit commented 5 years ago

A draft would work for the time being (pulkit.madaan@ymail.com). Thanks for the quick replies.

pipibjc commented 5 years ago

Here is an example that uses the binarized data from the multilingual example. Here I just demonstrate how to specify the command line correctly without tuning the hyper-parameters:

Training:

CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/   --max-epoch 50   --ddp-backend=no_c10d   --task multilingual_translation --arch multilingual_transformer_iwslt_de_en   --share-decoders --share-decoder-input-output-embed   --optimizer adam --adam-betas '(0.9, 0.98)'   --lr 0.0005 --lr-scheduler inverse_sqrt --min-lr '1e-09'   --warmup-updates 4000 --warmup-init-lr '1e-07'   --label-smoothing 0.1 --criterion label_smoothed_cross_entropy   --dropout 0.3 --weight-decay 0.0001   --save-dir checkpoints/multilingual_transformer   --max-tokens 4000   --update-freq 8     --max-update 20 --log-format json --lang-pairs de-en,fr-en,en-fr,en-de --encoder-langtok tgt

Inference:

CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/iwslt17.de_fr.en.bpe16k/   --task multilingual_translation --path checkpoints/multilingual_transformer/checkpoint_best.pt --source-lang en --target-lang fr --gen-subset valid  --lang-pairs de-en,fr-en,en-fr,en-de --encoder-langtok tgt

madaanpulkit commented 5 years ago

Here is an example that uses the binarized data from the multilingual example. Here I just demonstrate how to specify the command line correctly without tuning the hyper-parameters:

Training:

CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/   --max-epoch 50   --ddp-backend=no_c10d   --task multilingual_translation --arch multilingual_transformer_iwslt_de_en   --share-decoders --share-decoder-input-output-embed   --optimizer adam --adam-betas '(0.9, 0.98)'   --lr 0.0005 --lr-scheduler inverse_sqrt --min-lr '1e-09'   --warmup-updates 4000 --warmup-init-lr '1e-07'   --label-smoothing 0.1 --criterion label_smoothed_cross_entropy   --dropout 0.3 --weight-decay 0.0001   --save-dir checkpoints/multilingual_transformer   --max-tokens 4000   --update-freq 8     --max-update 20 --log-format json --lang-pairs de-en,fr-en,en-fr,en-de --encoder-langtok tgt

Inference:

CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/iwslt17.de_fr.en.bpe16k/   --task multilingual_translation --path checkpoints/multilingual_transformer/checkpoint_best.pt --source-lang en --target-lang fr --gen-subset valid  --lang-pairs de-en,fr-en,en-fr,en-de --encoder-langtok tgt

@pipibjc thanks for the help. Any particular reason behind using --encoder-langtok and not --decoder-langtok ?

pipibjc commented 5 years ago

I have experimented both --encoder-langtok tgt and --docoder-langtok on many-to-many case, but I didn't find any difference. I use --encoder-langtok tgt as example is just because the original paper suggested to do so.

madaanpulkit commented 5 years ago

I have experimented both --encoder-langtok tgt and --docoder-langtok on many-to-many case, but I didn't find any difference. I use --encoder-langtok tgt as example is just because the original paper suggested to do so.

I tried and --encoder-langtok tgt worked better for me.

masonreznov commented 3 years ago

Hello @pipibjc As the example is for many-to-one, it is intuitive to fix the --tgt-dict during the preprocessing. However, each language pair is both a possible src and tgt candidates for a many-to-many scenario. So, how is the --shared-decoder enabled in this pretext?

facebookresearch / fairseq

Multilingual Transformer with shared decoder #371