BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated

jogonba2 commented 4 years ago

❓ Questions and Help

What is your question?

I am finetuning BART-Large in a translation task through fairseq command line as in Bart-Fairseq but before starting the second epoch the process is terminated with the following error:

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: the launch timed out and was terminated (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f9582a79536 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x7ae (0x7f9582cbcfbe in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f9582a69abd in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so) frame #3: std::vector<c10d::Reducer::Bucket, std::allocator >::~vector() + 0x1d9 (0x7f95cebab619 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorc$ frame #4: c10d::Reducer::~Reducer() + 0x23a (0x7f95ceba0f6a in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: std::_Sp_counted_ptr<c10d::Reducer, (gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f95ceb7fef2 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python$ frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f95ce542506 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #7: + 0x871b9b (0x7f95ceb80b9b in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x2405b0 (0x7f95ce54f5b0 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #9: + 0x2417fe (0x7f95ce5507fe in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #10: + 0x10f998 (0x562f97760998 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #11: + 0x1ad971 (0x562f977fe971 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #12: + 0x10f998 (0x562f97760998 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #13: + 0x1ad971 (0x562f977fe971 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #14: + 0xfec08 (0x562f9774fc08 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #15: + 0x1100f7 (0x562f977610f7 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #16: + 0x11010d (0x562f9776110d in /home/ml/users/jgonza38/anaconda3/bin/python) frame #17: + 0x11010d (0x562f9776110d in /home/ml/users/jgonza38/anaconda3/bin/python) frame #18: + 0x110a97 (0x562f97761a97 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #19: + 0x110b34 (0x562f97761b34 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #20: + 0x1e91b3 (0x562f9783a1b3 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #21: _PyEval_EvalFrameDefault + 0x2966 (0x562f97823d96 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #22: _PyFunction_FastCallKeywords + 0xfb (0x562f977c979b in /home/ml/users/jgonza38/anaconda3/bin/python) frame #23: _PyEval_EvalFrameDefault + 0x6a0 (0x562f97821ad0 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #24: _PyFunction_FastCallKeywords + 0xfb (0x562f977c979b in /home/ml/users/jgonza38/anaconda3/bin/python) frame #25: _PyEval_EvalFrameDefault + 0x416 (0x562f97821846 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x562f977674f9 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #27: _PyFunction_FastCallKeywords + 0x387 (0x562f977c9a27 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #28: _PyEval_EvalFrameDefault + 0x14ce (0x562f978228fe in /home/ml/users/jgonza38/anaconda3/bin/python) frame #29: _PyEval_EvalCodeWithName + 0x2f9 (0x562f977674f9 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #30: PyEval_EvalCodeEx + 0x44 (0x562f977683c4 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #31: PyEval_EvalCode + 0x1c (0x562f977683ec in /home/ml/users/jgonza38/anaconda3/bin/python) frame #32: + 0x22f874 (0x562f97880874 in /home/ml/users/jgonza38/anaconda3/bin/python) frame #33: PyRun_StringFlags + 0x7d (0x562f9788baad in /home/ml/users/jgonza38/anaconda3/bin/python) frame #34: PyRun_SimpleStringFlags + 0x3f (0x562f9788bb0f in /home/ml/users/jgonza38/anaconda3/bin/python) frame #35: + 0x23ac0d (0x562f9788bc0d in /home/ml/users/jgonza38/anaconda3/bin/python) frame #36: _Py_UnixMain + 0x3c (0x562f9788bf7c in /home/ml/users/jgonza38/anaconda3/bin/python) frame #37: libc_start_main + 0xe7 (0x7f95de6d5b97 in /lib/x86_64-linux-gnu/libc.so.6) frame #38: + 0x1e0122 (0x562f97831122 in /home/ml/users/jgonza38/anaconda3/bin/python)

Process 1 terminated with the following error: Traceback (most recent call last): File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 324, in distributed_main main(args, init_distributed=True) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 117, in main valid_losses = train(args, trainer, task, epoch_itr, max_update) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner return func(args, *kwds) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 187, in train log_output = trainer.train_step(samples) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner return func(args, *kwds) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 412, in train_step logging_outputs, sample_size, ooms, ignore=is_dummy_batch, File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 685, in _aggregate_logging_outputs logging_outputs, extra_stats_to_sum, ignore=ignore File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 746, in _fast_stat_sync_sum group=self.data_parallel_process_group File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/distributed_utils.py", line 292, in all_reduce_dict cpu_data = _all_reduce_dict(cpu_data) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/distributed_utils.py", line 288, in _all_reduce_dict buf = torch.stack(list(data.values())).to(device=device) RuntimeError: CUDA error: the launch timed out and was terminated

I also used CUDA_LAUNCH_BLOCKING=1 to get a "more detailed error":

Process 1 terminated with the following error: Traceback (most recent call last): File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 324, in distributed_main main(args, init_distributed=True) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 117, in main valid_losses = train(args, trainer, task, epoch_itr, max_update) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner return func(args, *kwds) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 187, in train log_output = trainer.train_step(samples) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner return func(args, **kwds) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 399, in train_step raise e File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 377, in train_step ignore_grad=is_dummy_batch, File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/tasks/fairseq_task.py", line 342, in train_step optimizer.backward(loss) File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/optim/fairseq_optimizer.py", line 81, in backward loss.backward() File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914855613/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8

My interpretation of the traceback for the first error "RuntimeError: CUDA error: the launch timed out and was terminated" is the _all_reduce_dict takes much time in CPU. It could be related with this? Is there some option to increase the CUDA time out?

Code

CUDA_VISIBLE_DEVICES=0,1 python train.py "$FULL_TASK-bin" \ --max-epoch $MAX_EPOCHS \ --max-tokens $MAX_TOKENS \ --update-freq $UPDATE_FREQ \ --lr-scheduler polynomial_decay \ --lr $LR \ --total-num-update $TOTAL_NUM_UPDATES \ --warmup-updates $WARMUP_UPDATES \ --restore-file $BART/model.pt \ --save-dir $RESULTS_PATH \ --task translation \ --source-lang source \ --target-lang target \ --truncate-source \ --layernorm-embedding \ --share-all-embeddings \ --share-decoder-input-output-embed \ --reset-optimizer \ --reset-dataloader \ --reset-meters \ --required-batch-size-multiple 1 \ --arch bart_large \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --dropout 0.1 \ --attention-dropout 0.1 \ --weight-decay 0.01 \ --optimizer adam \ --adam-betas "(0.9, 0.999)" \ --adam-eps 1e-08 \ --clip-norm 0.1 \ --no-last-checkpoints \ --find-unused-parameters;

What have you tried?

To reduce the --max-tokens in case it was related to a GPU memory limitation.
To use --distributed-no-spawn (#https://github.com/pytorch/fairseq/issues/826)
To use different versions of pytorch (1.4.0 and 1.5.1)
To use different versions of fairseq (the master branch and the latest release of December)
The experiment runs well in another machine with 2x GeForce RTX 2080 Ti with the same python/pytorch/fairseq environment.

What's your environment?

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0): 1.5.1
OS (e.g., Linux): Ubuntu 18.04.4 LTS
How you installed fairseq (pip, source): pip install --editable ./
Build command you used (if compiling from source):
Python version: 3.7.3
CUDA/cuDNN version: CUDA 10.2, CUDNN 7.6.5
GPU models and configuration: 2x GeForce GTX TITAN X
Any other relevant information:

myleott commented 4 years ago

Hmm, I haven't seen this error before. Can you try adding --ddp-backend=no_c10d using the latest master branch?

jogonba2 commented 4 years ago

I also tried adding --ddp-backend=no_c10d with the master branch I used in my experimentation (not the latest one), and it thrown the same error. I will test it with the latest master branch once the GPUs are available and I will edit this comment.

Thank you!