When running the nyu-mll/mnli train set with NLP scholar in train mode using bert base uncased pretrained model, we get a RuntimeError: CUDA error (copied in observed behavior below). The config file used to get this error is also linked.
To Reproduce
Steps to reproduce the behavior:
Activate the conda environment for the necessary libraries for training a model using NLPScholar
Run the main.py from the NLPScholar with the linked config file below (converted to txt as github doesn't accept yaml)
Expected the model to finish training and be saved at the intended location.
Observed behavior
Got a CUDA error as shown in the screenshot, and copied below. The model was saved at the intended location, but I am unsure if it is trained as expected due to the error.
Error text:
Traceback (most recent call last):
File "/home/akhan/courses/cs426_NLP/NLPScholar/main.py", line 30, in
exp.train()
File "/home/akhan/courses/cs426_NLP/NLPScholar/src/trainers/HFTextClassificationTrainer.py", line 118, in train
trainer.train()
File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 1948, in train
return inner_training_loop(
File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 2289, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 3359, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2149, in backward
loss = loss / self.gradient_accumulation_steps
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Screenshots
Following shows the error in its full glory:
Setup
Used colgate turing with nlp conda environment (from the nlp course).
Additional context
Have already reproduced the error using a different device when running the same config file.
Have tried changing around the percentage of data being used using the samplePercent parameter in config to no avail. Even as small as 0.01% of the train data (0.01% of 393k rows) gave the same error, so it probably isn't related to the file size.
Please let me know if you need additional information on how to reproduce this error.
Thank you for posting this. There are two issues. I've attached an updated config below. Please reopen the issue if you run into further problems.
You are not loading a pretrained model. Pretrained true/false here picks out the entire model. We are adding a classifier on top of bert so you need to set it to False so a new classifier is added. If you don't do this, the existing classifier is used which is binary.
In doing this you'll find that the numLabel flag should be numLabels
Describe the bug
When running the nyu-mll/mnli train set with NLP scholar in train mode using bert base uncased pretrained model, we get a RuntimeError: CUDA error (copied in observed behavior below). The config file used to get this error is also linked.
To Reproduce
Steps to reproduce the behavior:
config file:
config_23.txt
Expected behavior
Expected the model to finish training and be saved at the intended location.
Observed behavior
Got a CUDA error as shown in the screenshot, and copied below. The model was saved at the intended location, but I am unsure if it is trained as expected due to the error.
Error text:
Traceback (most recent call last): File "/home/akhan/courses/cs426_NLP/NLPScholar/main.py", line 30, in
exp.train()
File "/home/akhan/courses/cs426_NLP/NLPScholar/src/trainers/HFTextClassificationTrainer.py", line 118, in train
trainer.train()
File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 1948, in train
return inner_training_loop(
File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 2289, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 3359, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2149, in backward
loss = loss / self.gradient_accumulation_steps
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Screenshots
Following shows the error in its full glory:
Setup
Additional context
Have already reproduced the error using a different device when running the same config file. Have tried changing around the percentage of data being used using the samplePercent parameter in config to no avail. Even as small as 0.01% of the train data (0.01% of 393k rows) gave the same error, so it probably isn't related to the file size. Please let me know if you need additional information on how to reproduce this error.