Closed thies1006 closed 4 years ago
For training, a single Pipe()
module is created for the Transformer encoder-decoder model. So, you need to set --pipeline-balance
and --pipeline-devices
in the training command, instead of --pipeline-encoder-balance
, --pipeline-encoder-devices
, --pipeline-decoder-balance
, --pipeline-decoder-devices
.
For inference/generation, two Pipe()
modules are created, one for the encoder and one for the decoder, since the encoder and decoder are called separately during generation. So, in that case, you need to set --pipeline-encoder-balance
, --pipeline-encoder-devices
, --pipeline-decoder-balance
, --pipeline-decoder-devices
instead.
Awesome, works now. Thank you very much.
Hi guys,
I was trying to train a transformer model with pipeline parallelism. Is this supposed to work already?
The command i tried (following the translation example):
fairseq-train data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en_pipeline_parallel --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --pipeline-model-parallel --pipeline-encoder-balance '[8]' --pipeline-encoder-devices '[0]' --pipeline-decoder-balance '[1,6,1]' --pipeline-decoder-devices '[0,1,0]' --pipeline-chunks 1 --distributed-world-size 2
error: