Open jogonba2 opened 4 years ago
Hmm, I haven't seen this error before. Can you try adding --ddp-backend=no_c10d
using the latest master branch?
I also tried adding --ddp-backend=no_c10d with the master branch I used in my experimentation (not the latest one), and it thrown the same error. I will test it with the latest master branch once the GPUs are available and I will edit this comment.
Thank you!
@jogonba2 Any solution because I faced similar error :(
Same error after the validation, any solution?
Same error after the validation, any solution?
Same error after the validation, any solution?
❓ Questions and Help
What is your question?
I am finetuning BART-Large in a translation task through fairseq command line as in Bart-Fairseq but before starting the second epoch the process is terminated with the following error:
terminate called after throwing an instance of 'c10::Error' what(): CUDA error: the launch timed out and was terminated (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f9582a79536 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x7ae (0x7f9582cbcfbe in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f9582a69abd in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so) frame #3: std::vector<c10d::Reducer::Bucket, std::allocator >::~vector() + 0x1d9 (0x7f95cebab619 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorc$
frame #4: c10d::Reducer::~Reducer() + 0x23a (0x7f95ceba0f6a in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer , (gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f95ceb7fef2 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python$
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f95ce542506 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x871b9b (0x7f95ceb80b9b in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x2405b0 (0x7f95ce54f5b0 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2417fe (0x7f95ce5507fe in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x10f998 (0x562f97760998 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #11: + 0x1ad971 (0x562f977fe971 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #12: + 0x10f998 (0x562f97760998 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #13: + 0x1ad971 (0x562f977fe971 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #14: + 0xfec08 (0x562f9774fc08 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #15: + 0x1100f7 (0x562f977610f7 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #16: + 0x11010d (0x562f9776110d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #17: + 0x11010d (0x562f9776110d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #18: + 0x110a97 (0x562f97761a97 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #19: + 0x110b34 (0x562f97761b34 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #20: + 0x1e91b3 (0x562f9783a1b3 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2966 (0x562f97823d96 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #22: _PyFunction_FastCallKeywords + 0xfb (0x562f977c979b in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6a0 (0x562f97821ad0 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #24: _PyFunction_FastCallKeywords + 0xfb (0x562f977c979b in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x416 (0x562f97821846 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x562f977674f9 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #27: _PyFunction_FastCallKeywords + 0x387 (0x562f977c9a27 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x14ce (0x562f978228fe in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0x2f9 (0x562f977674f9 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #30: PyEval_EvalCodeEx + 0x44 (0x562f977683c4 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #31: PyEval_EvalCode + 0x1c (0x562f977683ec in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #32: + 0x22f874 (0x562f97880874 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #33: PyRun_StringFlags + 0x7d (0x562f9788baad in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #34: PyRun_SimpleStringFlags + 0x3f (0x562f9788bb0f in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #35: + 0x23ac0d (0x562f9788bc0d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #36: _Py_UnixMain + 0x3c (0x562f9788bf7c in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #37: libc_start_main + 0xe7 (0x7f95de6d5b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: + 0x1e0122 (0x562f97831122 in /home/ml/users/jgonza38/anaconda3/bin/python)
Process 1 terminated with the following error: Traceback (most recent call last): File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 324, in distributed_main main(args, init_distributed=True) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 117, in main valid_losses = train(args, trainer, task, epoch_itr, max_update) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner return func(args, *kwds) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 187, in train log_output = trainer.train_step(samples) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner return func(args, *kwds) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 412, in train_step logging_outputs, sample_size, ooms, ignore=is_dummy_batch, File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 685, in _aggregate_logging_outputs logging_outputs, extra_stats_to_sum, ignore=ignore File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 746, in _fast_stat_sync_sum group=self.data_parallel_process_group File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/distributed_utils.py", line 292, in all_reduce_dict cpu_data = _all_reduce_dict(cpu_data) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/distributed_utils.py", line 288, in _all_reduce_dict buf = torch.stack(list(data.values())).to(device=device) RuntimeError: CUDA error: the launch timed out and was terminated
I also used CUDA_LAUNCH_BLOCKING=1 to get a "more detailed error":
Process 1 terminated with the following error: Traceback (most recent call last): File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 324, in distributed_main main(args, init_distributed=True) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 117, in main valid_losses = train(args, trainer, task, epoch_itr, max_update) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner return func(args, *kwds) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 187, in train log_output = trainer.train_step(samples) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner return func(args, **kwds) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 399, in train_step raise e File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 377, in train_step ignore_grad=is_dummy_batch, File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/tasks/fairseq_task.py", line 342, in train_step optimizer.backward(loss) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/optim/fairseq_optimizer.py", line 81, in backward loss.backward() File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914855613/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8
My interpretation of the traceback for the first error "RuntimeError: CUDA error: the launch timed out and was terminated" is the _all_reduce_dict takes much time in CPU. It could be related with this? Is there some option to increase the CUDA time out?
Code
CUDA_VISIBLE_DEVICES=0,1 python train.py "$FULL_TASK-bin" \ --max-epoch $MAX_EPOCHS \ --max-tokens $MAX_TOKENS \ --update-freq $UPDATE_FREQ \ --lr-scheduler polynomial_decay \ --lr $LR \ --total-num-update $TOTAL_NUM_UPDATES \ --warmup-updates $WARMUP_UPDATES \ --restore-file $BART/model.pt \ --save-dir $RESULTS_PATH \ --task translation \ --source-lang source \ --target-lang target \ --truncate-source \ --layernorm-embedding \ --share-all-embeddings \ --share-decoder-input-output-embed \ --reset-optimizer \ --reset-dataloader \ --reset-meters \ --required-batch-size-multiple 1 \ --arch bart_large \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --dropout 0.1 \ --attention-dropout 0.1 \ --weight-decay 0.01 \ --optimizer adam \ --adam-betas "(0.9, 0.999)" \ --adam-eps 1e-08 \ --clip-norm 0.1 \ --no-last-checkpoints \ --find-unused-parameters;
What have you tried?
What's your environment?
pip
, source): pip install --editable ./