I'm experimenting with training a transformer model with layerdrop, but when i don't use ddp-backend noc10d I'm getting an error (instead, with this parameter set, it seems to work)
Traceback (most recent call last):
File "/usr/local/bin/fairseq-train", line 8, in
sys.exit(cli_main())
File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 352, in cli_main
distributed_utils.call_main(args, main)
File "/usr/local/lib/python3.6/dist-packages/fairseq/distributed_utils.py", line 286, in call_main
nprocs=args.distributed_num_procs,
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.6/dist-packages/fairseq/distributed_utils.py", line 270, in distributed_main
main(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 125, in main
valid_losses, should_stop = train(args, trainer, task, epoch_itr)
File "/usr/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, *kwds)
File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 208, in train
log_output = trainer.train_step(samples)
File "/usr/lib/python3.6/contextlib.py", line 52, in inner
return func(args, kwds)
File "/usr/local/lib/python3.6/dist-packages/fairseq/trainer.py", line 512, in train_step
raise e
File "/usr/local/lib/python3.6/dist-packages/fairseq/trainer.py", line 486, in train_step
ignore_grad=is_dummy_batch,
File "/usr/local/lib/python3.6/dist-packages/fairseq/tasks/fairseq_task.py", line 416, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, kwargs)
File "/usr/local/lib/python3.6/dist-packages/fairseq/criterions/label_smoothed_cross_entropy.py", line 69, in forward
net_output = model(sample["net_input"])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, ite
rable).
Code sample
Expected behavior
Environment
fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0) 1.7
OS (e.g., Linux): linux
How you installed fairseq (pip, source):pip install fairseq
Build command you used (if compiling from source):
should work with fully sharded backend also. c10d doesnt support dynamic code that may take different branches for different workers, thats a pytorch limitation
🐛 Bug
I'm experimenting with training a transformer model with layerdrop, but when i don't use ddp-backend noc10d I'm getting an error (instead, with this parameter set, it seems to work)
To Reproduce
WORKING fairseq-train \ ./ \ --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 2000 --warmup-init-lr 1e-07 \ --dropout 0.1 --weight-decay 0.0 \ --update-freq 32 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 5000 \ --fp16 \ --log-interval 1 \ --skip-invalid-size-inputs-valid-test \ --save-interval-updates 500 \ --keep-interval-updates 5 \ --keep-best-checkpoints 3 \ --save-dir checkpoints \ --activation-fn gelu_fast \ --encoder-layers 12 \ --decoder-layers 6 \ --encoder-layerdrop 0.25 --decoder-layerdrop 0.25 --ddp-backend no_c10d
NOT WORKING fairseq-train \ ./ \ --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 2000 --warmup-init-lr 1e-07 \ --dropout 0.1 --weight-decay 0.0 \ --update-freq 32 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 5000 \ --fp16 \ --log-interval 1 \ --skip-invalid-size-inputs-valid-test \ --save-interval-updates 500 \ --keep-interval-updates 5 \ --keep-best-checkpoints 3 \ --save-dir checkpoints \ --activation-fn gelu_fast \ --encoder-layers 12 \ --decoder-layers 6 \ --encoder-layerdrop 0.25 --decoder-layerdrop 0.25
Traceback (most recent call last): File "/usr/local/bin/fairseq-train", line 8, in
sys.exit(cli_main())
File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 352, in cli_main
distributed_utils.call_main(args, main)
File "/usr/local/lib/python3.6/dist-packages/fairseq/distributed_utils.py", line 286, in call_main
nprocs=args.distributed_num_procs,
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 3 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/usr/local/lib/python3.6/dist-packages/fairseq/distributed_utils.py", line 270, in distributed_main main(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 125, in main valid_losses, should_stop = train(args, trainer, task, epoch_itr) File "/usr/lib/python3.6/contextlib.py", line 52, in inner return func(*args, *kwds) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 208, in train log_output = trainer.train_step(samples) File "/usr/lib/python3.6/contextlib.py", line 52, in inner return func(args, kwds) File "/usr/local/lib/python3.6/dist-packages/fairseq/trainer.py", line 512, in train_step raise e File "/usr/local/lib/python3.6/dist-packages/fairseq/trainer.py", line 486, in train_step ignore_grad=is_dummy_batch, File "/usr/local/lib/python3.6/dist-packages/fairseq/tasks/fairseq_task.py", line 416, in train_step loss, sample_size, logging_output = criterion(model, sample) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/fairseq/criterions/label_smoothed_cross_entropy.py", line 69, in forward net_output = model(sample["net_input"]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 606, in forward if self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
; (2) making sure allforward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, ite rable).Code sample
Expected behavior
Environment
pip
, source):pip install fairseqAdditional context