Error in evaluate with a RoBERTa model - RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

I trainned a RoBERTa model in Colab, with train_model.py, but when I try to evaluate, I get the error below.

Comand used:

!python -m evaluate --test data/ES-Spanish/es_dev.conll --out_dir . --gpus 1 --encoder_model xlm-roberta-base --model xlmr_ner/lightning_logs/version_0/checkpoints/xlmr_ner_timestamp_1633002757.1910515_final.ckpt --prefix xlmr_ner_results

Log error:

Downloading: 100% 513/513 [00:00<00:00, 423kB/s] Downloading: 100% 5.07M/5.07M [00:01<00:00, 3.01MB/s] Downloading: 100% 9.10M/9.10M [00:01<00:00, 4.75MB/s] 2021-10-01 09:19:34 - INFO - reader - Reading file data/ES-Spanish/es_dev.conll 2021-10-01 09:19:40 - INFO - reader - Finished reading 800 instances from file data/ES-Spanish/es_dev.conll Downloading: 100% 613/613 [00:00<00:00, 571kB/s] Downloading: 100% 499M/499M [00:13<00:00, 37.3MB/s] Some weights of RobertaModel were not initialized from the model checkpoint at BSC-TeMU/roberta-base-bne and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Global seed set to 42 GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py:679: LightningDeprecationWarning: trainer.test(test_dataloaders) is deprecated in v1.4 and will be removed in v1.6. Use trainer.test(dataloaders) instead. "trainer.test(test_dataloaders) is deprecated in v1.4 and will be removed in v1.6." LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Testing: 0it [00:00, ?it/s]/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [84,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/gdrive/My Drive/SemEval/baseline/multiconer-baseline/evaluate.py", line 16, in out = trainer.test(model, test_dataloaders=DataLoader(test_data, batch_size=sg.batch_size, collate_fn=model.collate_batch)) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 705, in test results = self._run(model) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run self._dispatch() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch self.accelerator.start_evaluating(self) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 95, in start_evaluating self.training_type_plugin.start_evaluating(trainer) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 165, in start_evaluating self._results = trainer.run_stage() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 997, in run_stage return self._run_evaluate() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1083, in _run_evaluate eval_loop_results = self._evaluation_loop.run() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 111, in advance dataloader_iter, self.current_dataloader_idx, dl_max_batches, self.num_dataloaders File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 111, in advance output = self.evaluation_step(batch, batch_idx, dataloader_idx) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 154, in evaluation_step output = self.trainer.accelerator.test_step(step_kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in test_step return self.training_type_plugin.test_step(step_kwargs.values()) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 181, in test_step return self.model.test_step(args, kwargs) File "/content/gdrive/My Drive/SemEval/baseline/multiconer-baseline/model/ner_model.py", line 139, in test_step output = self.perform_forward_step(batch, mode=self.stage) File "/content/gdrive/My Drive/SemEval/baseline/multiconer-baseline/model/ner_model.py", line 153, in perform_forward_step embedded_text_input = self.encoder(input_ids=tokens, attention_mask=token_mask) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 825, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 515, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 400, in forward past_key_value=self_attn_past_key_value, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 330, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 187, in forward mixed_query_layer = self.query(hidden_states) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py", line 96, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1847, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

What am I doing wrong?

Thank you.

amzn / multiconer-baseline

Error in evaluate with a RoBERTa model - RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` #2