CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

Hi, after following the instructions here to make the code run for abstractive text summarization, I am running into the following issue:

2021-08-02 18:15:48 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:15:48 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:15:48 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:15:48 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:15:53 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:15:53 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:15:53 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:15:53 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x5612fbcaa000 @  0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
tcmalloc: large alloc 1625169920 bytes == 0x56135ca8c000 @  0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
2021-08-02 18:16:00 | INFO | fairseq.trainer | NOTE: your device does NOT support faster training with --fp16, please switch to FP32 which is likely to be faster
2021-08-02 18:16:00 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:16:00 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:16:00 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:16:00 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:16:01 | INFO | fairseq.trainer | begin training epoch 1
2021-08-02 18:16:11 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
2021-08-02 18:16:20 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
2021-08-02 18:16:29 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
2021-08-02 18:16:38 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2021-08-02 18:16:48 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
2021-08-02 18:16:57 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
2021-08-02 18:17:06 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
2021-08-02 18:17:15 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
2021-08-02 18:17:30 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
    main(cfg, **kwargs)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
    log_output = trainer.train_step(samples)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
    raise e
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
    ignore_grad=is_dummy_batch,
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
    loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
    optimizer.backward(loss)
  File "/content/R-Drop/fairseq_src/fairseq/optim/fp16_optimizer.py", line 101, in backward
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

I am using CUDA 11.4 (tried with 11.0 before), pytorch 1.8.1, python 3.7. I have preprocessed the CNN/Daily Mail data as instructed, am using bart.large and the script/run_train.sh is in the default configuration.

If I run without the --fp16 option, my code fails instead in the following way

2021-08-02 18:27:32 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:27:32 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:27:32 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:27:32 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:27:37 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:27:37 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:27:37 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:27:37 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x5610c8c0c000 @  0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
tcmalloc: large alloc 1625169920 bytes == 0x56112a1ee000 @  0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
2021-08-02 18:27:42 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:27:42 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:27:43 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:27:43 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:27:44 | INFO | fairseq.trainer | begin training epoch 1
/content/R-Drop/fairseq_src/fairseq/utils.py:345: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2021-08-02 18:28:42 | INFO | train_inner | epoch 001:    100 / 253944 loss=14.455, nll_loss=9.638, ppl=796.58, wps=117.7, ups=1.72, wpb=68.4, bsz=1.1, num_updates=100, lr=6e-06, gnorm=232.648, clip=100, train_wall=58, gb_free=4.4, wall=66
2021-08-02 18:29:37 | INFO | train_inner | epoch 001:    200 / 253944 loss=10.224, nll_loss=6.292, ppl=78.34, wps=125.7, ups=1.81, wpb=69.4, bsz=1.1, num_updates=200, lr=1.2e-05, gnorm=34.896, clip=100, train_wall=55, gb_free=6.8, wall=121
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
    main(cfg, **kwargs)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
    log_output = trainer.train_step(samples)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
    raise e
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
    ignore_grad=is_dummy_batch,
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
    loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
    optimizer.backward(loss)
  File "/content/R-Drop/fairseq_src/fairseq/optim/fairseq_optimizer.py", line 99, in backward
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

I have tried to use the bart.base model, thinking it could be due to the size requirements and that my GPU only has 16GB of memory, but I run into dictionary size issues as described here.

Any advice on the above?

Update: I believe I fixed the above issue. I installed pytorch 1.9 with CUDA 11.1 as per pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html from this link.

Even though it's hitting OOMs a couple of times so far, it seems to recover from them and train.

Perhaps update the README file with pytorch 1.9 as a requirement?

2021-08-02 18:47:18 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:47:18 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:47:18 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:47:18 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:47:23 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:47:23 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:47:23 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:47:23 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2021-08-02 18:47:23 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:47:23 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:47:23 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:47:23 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x561bcc7ea000 @  0x7f068e361b6b 0x7f068e381379 0x7f05c59c026e 0x7f05c59c19e2 0x7f05c7beecb9 0x7f066a6f7759 0x561af2daea65 0x561af2d6f7b2 0x561af2de2e65 0x561af2dde235 0x561af2d7034b 0x561af2d6fe59 0x561af2eb725d 0x561af2e26c3b 0x561af2d6ef01 0x561af2e60c0d 0x561af2de30d8 0x561af2dde235 0x561af2cafe2c 0x561af2de0318 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a
tcmalloc: large alloc 1625169920 bytes == 0x561c2ddcc000 @  0x7f068e361b6b 0x7f068e381379 0x7f05c59c026e 0x7f05c59c19e2 0x7f05c7beecb9 0x7f066a6f7759 0x561af2daea65 0x561af2d6f7b2 0x561af2de2e65 0x561af2dde235 0x561af2d7034b 0x561af2d6fe59 0x561af2eb725d 0x561af2e26c3b 0x561af2d6ef01 0x561af2e60c0d 0x561af2de30d8 0x561af2dde235 0x561af2cafe2c 0x561af2de0318 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a 0x561af2ddf93b 0x561af2dddc35 0x561af2d7073a
2021-08-02 18:47:29 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:47:29 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:47:29 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:47:29 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:47:29 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:47:29 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:47:30 | INFO | fairseq.trainer | begin training epoch 1
/content/R-Drop/fairseq_src/fairseq/utils.py:345: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2021-08-02 18:48:27 | INFO | train_inner | epoch 001:    100 / 253944 loss=14.455, nll_loss=9.638, ppl=796.58, wps=121.6, ups=1.78, wpb=68.4, bsz=1.1, num_updates=100, lr=6e-06, gnorm=232.649, clip=100, train_wall=56, gb_free=4.4, wall=64
2021-08-02 18:49:20 | INFO | train_inner | epoch 001:    200 / 253944 loss=10.226, nll_loss=6.293, ppl=78.39, wps=129.5, ups=1.87, wpb=69.4, bsz=1.1, num_updates=200, lr=1.2e-05, gnorm=35.015, clip=100, train_wall=53, gb_free=6.8, wall=117
2021-08-02 18:50:15 | INFO | train_inner | epoch 001:    300 / 253944 loss=10.418, nll_loss=6.639, ppl=99.69, wps=125.6, ups=1.84, wpb=68.1, bsz=1.1, num_updates=300, lr=1.8e-05, gnorm=31.224, clip=100, train_wall=54, gb_free=3.5, wall=171
2021-08-02 18:51:09 | INFO | train_inner | epoch 001:    400 / 253944 loss=10.046, nll_loss=6.277, ppl=77.55, wps=130.9, ups=1.82, wpb=71.8, bsz=1.2, num_updates=400, lr=2.4e-05, gnorm=30.56, clip=100, train_wall=54, gb_free=3.5, wall=226
2021-08-02 18:52:05 | INFO | train_inner | epoch 001:    500 / 253944 loss=10.067, nll_loss=6.331, ppl=80.51, wps=133.4, ups=1.81, wpb=73.6, bsz=1.2, num_updates=500, lr=3e-05, gnorm=26.353, clip=100, train_wall=55, gb_free=3.9, wall=281
2021-08-02 18:53:00 | INFO | train_inner | epoch 001:    600 / 253944 loss=10.086, nll_loss=6.371, ppl=82.79, wps=127.6, ups=1.81, wpb=70.6, bsz=1.1, num_updates=600, lr=2.98462e-05, gnorm=28.198, clip=100, train_wall=55, gb_free=7, wall=337
2021-08-02 18:53:56 | INFO | train_inner | epoch 001:    700 / 253944 loss=10.208, nll_loss=6.533, ppl=92.61, wps=128.5, ups=1.79, wpb=71.9, bsz=1.1, num_updates=700, lr=2.96923e-05, gnorm=24.811, clip=100, train_wall=55, gb_free=3.5, wall=393
2021-08-02 18:54:21 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 15.90 GiB total capacity; 14.76 GiB already allocated; 55.75 MiB free; 14.88 GiB reserved in total by PyTorch)
2021-08-02 18:54:21 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 3         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   15046 MB |   15114 MB |   14505 GB |   14490 GB |
|       from large pool |   15015 MB |   15083 MB |   14253 GB |   14239 GB |
|       from small pool |      31 MB |      31 MB |     251 GB |     251 GB |
|---------------------------------------------------------------------------|
| Active memory         |   15046 MB |   15114 MB |   14505 GB |   14490 GB |
|       from large pool |   15015 MB |   15083 MB |   14253 GB |   14239 GB |
|       from small pool |      31 MB |      31 MB |     251 GB |     251 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   15236 MB |   15236 MB |   21906 MB |    6670 MB |
|       from large pool |   15194 MB |   15194 MB |   21516 MB |    6322 MB |
|       from small pool |      42 MB |     208 MB |     390 MB |     348 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  123909 KB |    4558 MB |   10331 GB |   10331 GB |
|       from large pool |  112863 KB |    4522 MB |   10040 GB |   10040 GB |
|       from small pool |   11046 KB |      36 MB |     290 GB |     290 GB |
|---------------------------------------------------------------------------|
| Allocations           |    2217    |    2218    |    3189 K  |    3187 K  |
|       from large pool |    1081    |    1082    |    1074 K  |    1073 K  |
|       from small pool |    1136    |    1237    |    2115 K  |    2114 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    2217    |    2218    |    3189 K  |    3187 K  |
|       from large pool |    1081    |    1082    |    1074 K  |    1073 K  |
|       from small pool |    1136    |    1237    |    2115 K  |    2114 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     420    |     501    |     734    |     314    |
|       from large pool |     399    |     399    |     539    |     140    |
|       from small pool |      21    |     104    |     195    |     174    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      80    |     184    |    1469 K  |    1469 K  |
|       from large pool |      56    |     116    |     666 K  |     666 K  |
|       from small pool |      24    |      68    |     802 K  |     802 K  |
|===========================================================================|

2021-08-02 18:54:21 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2021-08-02 18:54:52 | INFO | train_inner | epoch 001:    801 / 253944 loss=10.156, nll_loss=6.487, ppl=89.73, wps=126, ups=1.8, wpb=70.2, bsz=1.2, num_updates=800, lr=2.95385e-05, gnorm=3.871, clip=100, train_wall=55, gb_free=6, wall=504
2021-08-02 18:56:33 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 170.00 MiB (GPU 0; 15.90 GiB total capacity; 14.32 GiB already allocated; 59.75 MiB free; 14.88 GiB reserved in total by PyTorch)
2021-08-02 18:56:33 | WARNING | fairseq.trainer | |===========================================================================||                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 2            |        cudaMalloc retries: 6         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   14665 MB |   14665 MB |   19211 GB |   19197 GB |
|       from large pool |   14659 MB |   14659 MB |   18879 GB |   18865 GB |
|       from small pool |       6 MB |       6 MB |     332 GB |     332 GB |
|---------------------------------------------------------------------------|
| Active memory         |   14665 MB |   14665 MB |   19211 GB |   19197 GB ||       from large pool |   14659 MB |   14659 MB |   18879 GB |   18865 GB ||       from small pool |       6 MB |       6 MB |     332 GB |     332 GB |
|---------------------------------------------------------------------------|| GPU reserved memory   |   15232 MB |   15232 MB |   26976 MB |   11744 MB ||       from large pool |   15216 MB |   15216 MB |   26420 MB |   11204 MB |
|       from small pool |      16 MB |     208 MB |     556 MB |     540 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  580356 KB |    1288 MB |   13590 GB |   13590 GB |
|       from large pool |  570355 KB |    1278 MB |   13206 GB |   13206 GB |
|       from small pool |   10001 KB |      43 MB |     383 GB |     383 GB |
|---------------------------------------------------------------------------|| Allocations           |    2217    |    2218    |    4214 K  |    4212 K  |
|       from large pool |    1118    |    1118    |    1415 K  |    1414 K  |
|       from small pool |    1099    |    1237    |    2798 K  |    2797 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    2217    |    2218    |    4214 K  |    4212 K  |
|       from large pool |    1118    |    1118    |    1415 K  |    1414 K  ||       from small pool |    1099    |    1237    |    2798 K  |    2797 K  |
|---------------------------------------------------------------------------|| GPU reserved segments |     415    |     510    |     900    |     485    |
|       from large pool |     407    |     407    |     622    |     215    |
|       from small pool |       8    |     104    |     278    |     270    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     138    |     159    |    1950 K  |    1950 K  |
|       from large pool |     124    |     127    |     887 K  |     887 K  |
|       from small pool |      14    |      87    |    1063 K  |    1063 K  |
|===========================================================================|

2021-08-02 18:56:33 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2021-08-02 18:56:42 | INFO | train_inner | epoch 001:   1002 / 253944 loss=10.157, nll_loss=6.531, ppl=92.46, wps=111.8, ups=1.82, wpb=61.5, bsz=1.1, num_updates=1000, lr=2.92308e-05, gnorm=24.907, clip=100, train_wall=54, gb_free=4.2, wall=559
2021-08-02 18:57:38 | INFO | train_inner | epoch 001:   1102 / 253944 loss=10.207, nll_loss=6.586, ppl=96.06, wps=122, ups=1.8, wpb=68, bsz=1, num_updates=1100, lr=2.90769e-05, gnorm=24.622, clip=100, train_wall=55, gb_free=7.5, wall=615
2021-08-02 18:58:34 | INFO | train_inner | epoch 001:   1202 / 253944 loss=10.424, nll_loss=6.821, ppl=113.07, wps=127.4, ups=1.79, wpb=71.1, bsz=1.2, num_updates=1200, lr=2.89231e-05, gnorm=23.413, clip=100, train_wall=55, gb_free=5.1, wall=671
2021-08-02 18:59:30 | INFO | train_inner | epoch 001:   1302 / 253944 loss=9.583, nll_loss=5.907, ppl=60.01, wps=140.5, ups=1.78, wpb=79.1, bsz=1.2, num_updates=1300, lr=2.87692e-05, gnorm=21.703, clip=100, train_wall=56, gb_free=3.5, wall=727

dropreg / R-Drop

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED #9