facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

Error when trying to train with pipeline parallelism #2782

Closed thies1006 closed 4 years ago

thies1006 commented 4 years ago

Hi guys,

I was trying to train a transformer model with pipeline parallelism. Is this supposed to work already?

The command i tried (following the translation example): fairseq-train data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en_pipeline_parallel --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --pipeline-model-parallel --pipeline-encoder-balance '[8]' --pipeline-encoder-devices '[0]' --pipeline-decoder-balance '[1,6,1]' --pipeline-decoder-devices '[0,1,0]' --pipeline-chunks 1 --distributed-world-size 2

error:

2020-10-23 17:17:08 | INFO | fairseq.tasks.translation | [de] dictionary: 8848 types
2020-10-23 17:17:08 | INFO | fairseq.tasks.translation | [en] dictionary: 6632 types
2020-10-23 17:17:08 | INFO | fairseq.data.data_utils | loaded 7283 examples from: data-bin/iwslt14.tokenized.de-en/valid.de-en.de
2020-10-23 17:17:08 | INFO | fairseq.data.data_utils | loaded 7283 examples from: data-bin/iwslt14.tokenized.de-en/valid.de-en.en
2020-10-23 17:17:08 | INFO | fairseq.tasks.translation | data-bin/iwslt14.tokenized.de-en valid de-en 7283 examples
Traceback (most recent call last):
  File "/secondary/thies/.virtualenvs/pytorch-23102020/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/tertiary/thies/fairseq/fairseq_cli/train.py", line 352, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/tertiary/thies/fairseq/fairseq/distributed_utils.py", line 301, in call_main
    cfg.distributed_training.distributed_world_size,
  File "/secondary/thies/.virtualenvs/pytorch-23102020/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 247, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/secondary/thies/.virtualenvs/pytorch-23102020/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 205, in start_processes
    while not context.join():
  File "/secondary/thies/.virtualenvs/pytorch-23102020/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 166, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/secondary/thies/.virtualenvs/pytorch-23102020/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/tertiary/thies/fairseq/fairseq/distributed_utils.py", line 283, in distributed_main
    main(cfg, **kwargs)
  File "/tertiary/thies/fairseq/fairseq_cli/train.py", line 74, in main
    model = task.build_model(cfg.model)
  File "/tertiary/thies/fairseq/fairseq/tasks/translation.py", line 327, in build_model
    model = super().build_model(args)
  File "/tertiary/thies/fairseq/fairseq/tasks/fairseq_task.py", line 548, in build_model
    model = models.build_model(args, self)
  File "/tertiary/thies/fairseq/fairseq/models/__init__.py", line 56, in build_model
    return ARCH_MODEL_REGISTRY[cfg.arch].build_model(cfg, task)
  File "/tertiary/thies/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 277, in build_model
    checkpoint=args.pipeline_checkpoint,
  File "/tertiary/thies/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 57, in __init__
    + [encoder.final_layer_norm]
  File "/secondary/thies/.virtualenvs/pytorch-23102020/lib/python3.6/site-packages/torch/nn/modules/module.py", line 796, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'TransformerEncoder' object has no attribute 'embedding_layer'
shruti-bh commented 4 years ago

For training, a single Pipe() module is created for the Transformer encoder-decoder model. So, you need to set --pipeline-balance and --pipeline-devices in the training command, instead of --pipeline-encoder-balance, --pipeline-encoder-devices, --pipeline-decoder-balance, --pipeline-decoder-devices. For inference/generation, two Pipe() modules are created, one for the encoder and one for the decoder, since the encoder and decoder are called separately during generation. So, in that case, you need to set --pipeline-encoder-balance, --pipeline-encoder-devices, --pipeline-decoder-balance, --pipeline-decoder-devices instead.

thies1006 commented 4 years ago

Awesome, works now. Thank you very much.