facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

Is layerdrop working only with --ddp-backend no_c10d? #3599

Open davidepatrucco opened 3 years ago

davidepatrucco commented 3 years ago

🐛 Bug

I'm experimenting with training a transformer model with layerdrop, but when i don't use ddp-backend noc10d I'm getting an error (instead, with this parameter set, it seems to work)

To Reproduce

WORKING fairseq-train \ ./ \ --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 2000 --warmup-init-lr 1e-07 \ --dropout 0.1 --weight-decay 0.0 \ --update-freq 32 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 5000 \ --fp16 \ --log-interval 1 \ --skip-invalid-size-inputs-valid-test \ --save-interval-updates 500 \ --keep-interval-updates 5 \ --keep-best-checkpoints 3 \ --save-dir checkpoints \ --activation-fn gelu_fast \ --encoder-layers 12 \ --decoder-layers 6 \ --encoder-layerdrop 0.25 --decoder-layerdrop 0.25 --ddp-backend no_c10d

NOT WORKING fairseq-train \ ./ \ --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 2000 --warmup-init-lr 1e-07 \ --dropout 0.1 --weight-decay 0.0 \ --update-freq 32 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 5000 \ --fp16 \ --log-interval 1 \ --skip-invalid-size-inputs-valid-test \ --save-interval-updates 500 \ --keep-interval-updates 5 \ --keep-best-checkpoints 3 \ --save-dir checkpoints \ --activation-fn gelu_fast \ --encoder-layers 12 \ --decoder-layers 6 \ --encoder-layerdrop 0.25 --decoder-layerdrop 0.25

Traceback (most recent call last): File "/usr/local/bin/fairseq-train", line 8, in sys.exit(cli_main()) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 352, in cli_main distributed_utils.call_main(args, main) File "/usr/local/lib/python3.6/dist-packages/fairseq/distributed_utils.py", line 286, in call_main nprocs=args.distributed_num_procs, File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/usr/local/lib/python3.6/dist-packages/fairseq/distributed_utils.py", line 270, in distributed_main main(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 125, in main valid_losses, should_stop = train(args, trainer, task, epoch_itr) File "/usr/lib/python3.6/contextlib.py", line 52, in inner return func(*args, *kwds) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 208, in train log_output = trainer.train_step(samples) File "/usr/lib/python3.6/contextlib.py", line 52, in inner return func(args, kwds) File "/usr/local/lib/python3.6/dist-packages/fairseq/trainer.py", line 512, in train_step raise e File "/usr/local/lib/python3.6/dist-packages/fairseq/trainer.py", line 486, in train_step ignore_grad=is_dummy_batch, File "/usr/local/lib/python3.6/dist-packages/fairseq/tasks/fairseq_task.py", line 416, in train_step loss, sample_size, logging_output = criterion(model, sample) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/fairseq/criterions/label_smoothed_cross_entropy.py", line 69, in forward net_output = model(sample["net_input"]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 606, in forward if self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, ite rable).

Code sample

Expected behavior

Environment

Additional context

alexeib commented 3 years ago

should work with fully sharded backend also. c10d doesnt support dynamic code that may take different branches for different workers, thats a pytorch limitation