[BUG/ERROR] Running bert-base-uncased to train on MNLI gives CUDA error

l1gh7vel commented 2 weeks ago

Describe the bug

When running the nyu-mll/mnli train set with NLP scholar in train mode using bert base uncased pretrained model, we get a RuntimeError: CUDA error (copied in observed behavior below). The config file used to get this error is also linked.

To Reproduce

Steps to reproduce the behavior:

Activate the conda environment for the necessary libraries for training a model using NLPScholar
Run the main.py from the NLPScholar with the linked config file below (converted to txt as github doesn't accept yaml)

config file:

config_23.txt

Expected behavior

Expected the model to finish training and be saved at the intended location.

Observed behavior

Got a CUDA error as shown in the screenshot, and copied below. The model was saved at the intended location, but I am unsure if it is trained as expected due to the error.

Error text:

Traceback (most recent call last): File "/home/akhan/courses/cs426_NLP/NLPScholar/main.py", line 30, in exp.train() File "/home/akhan/courses/cs426_NLP/NLPScholar/src/trainers/HFTextClassificationTrainer.py", line 118, in train trainer.train() File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 1948, in train return inner_training_loop( File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 2289, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/transformers/trainer.py", line 3359, in training_step self.accelerator.backward(loss, **kwargs) File "/home/akhan/anaconda3/envs/nlp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2149, in backward loss = loss / self.gradient_accumulation_steps RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Screenshots

Following shows the error in its full glory:

Setup

Used colgate turing with nlp conda environment (from the nlp course).

Additional context

Have already reproduced the error using a different device when running the same config file. Have tried changing around the percentage of data being used using the samplePercent parameter in config to no avail. Even as small as 0.01% of the train data (0.01% of 393k rows) gave the same error, so it probably isn't related to the file size. Please let me know if you need additional information on how to reproduce this error.

forrestdavis commented 2 weeks ago

Thank you for posting this. There are two issues. I've attached an updated config below. Please reopen the issue if you run into further problems.

You are not loading a pretrained model. Pretrained true/false here picks out the entire model. We are adding a classifier on top of bert so you need to set it to False so a new classifier is added. If you don't do this, the existing classifier is used which is binary.
In doing this you'll find that the numLabel flag should be numLabels

config_23.txt

l1gh7vel commented 2 weeks ago

Thank you so much for the prompt fix!

forrestdavis / NLPScholar