RuntimeError: CUDA error: device-side assert triggered while eval

ThilinaRajapakse / simpletransformers

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI

https://simpletransformers.ai/

Apache License 2.0

4.1k stars 728 forks source link

RuntimeError: CUDA error: device-side assert triggered while eval #84

Closed avi-jain closed 4 years ago

avi-jain commented 4 years ago

Describe the bug I trained a distilbert model for Classification. Now when I try to use the model to eval or predict I get a RuntimeError: CUDA error: device-side assert triggered To Reproduce Steps to reproduce the behavior: Train a model using the params ClassificationModel('distilbert', 'distilbert-base-uncased', num_labels=len(df_bert_train['labels'].unique()), args={'reprocess_input_data': True, 'overwrite_output_dir': True, 'max_seq_length': 64, 'train_batch_size':16, 'fp16':False, 'num_train_epochs': 10})

Desktop (please complete the following information):

Linux

ThilinaRajapakse commented 4 years ago

This is often caused by not setting the num_labels parameter to the same value that was used when training the model. Make sure that you use the same value when loading a saved model.

avi-jain commented 4 years ago

Additional context - I'm using eval in the same context (jupyter notebook) in which the model was trained. Not loading it again. Although, by disabling GPUs and loading the model, eval works

ThilinaRajapakse commented 4 years ago

Can you show me the line you are executing to perform the evaluation? Also, the terminal where you launch the jupyter notebook might show some additional info about the error.

avi-jain commented 4 years ago

I checked the terminal output. Nothing except for a couple of warnings (WARNING | WARNING: attempted to send message from fork). I'll save all my transient variables, load the model after enabling gpus and try the eval again. The line is the same as your example result, model_outputs, wrong_predictions = model.eval_model(df_bert_dev, verbose=True)

Attaching full stack-trace

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-8eb2d33eb606> in <module>()
----> 1 predictions, raw_outputs = model.predict(["Sentence test"])

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py in predict(self, to_predict, multi_label)
    563         for batch in tqdm(eval_dataloader, disable=args['silent']):
    564             model.eval()
--> 565             batch = tuple(t.to(device) for t in batch)
    566 
    567             with torch.no_grad():

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py in <genexpr>(.0)
    563         for batch in tqdm(eval_dataloader, disable=args['silent']):
    564             model.eval()
--> 565             batch = tuple(t.to(device) for t in batch)
    566 
    567             with torch.no_grad():

RuntimeError: CUDA error: device-side assert triggered

ThilinaRajapakse commented 4 years ago

Check how much GPU memory is in use and try lowering the eval_batch_size in case the evaluation is exceeding the available GPU memory. Rerunning cells inside Jupyter tends to cause memory issues, as it doesn't release memory properly.

Also, it might be better to get the labels from the entire df rather than just the train df on the off chance that the dev df contains labels that are not in train.

num_labels=len(df_bert['labels'].unique())

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ghost commented 4 years ago

@ThilinaRajapakse Hi, Just letting you know that I faced the same error 'RuntimeError: CUDA error: device-side assert triggered' today with the latest version(0.22.0). After trying a lot of things to overcome it, I tried a lower version of simpletransformers(0.20.0) in the end and it worked. I was trying this example: https://towardsdatascience.com/simple-transformers-introducing-the-easiest-bert-roberta-xlnet-and-xlm-library-58bf8c59b2a3

Thanks a lot for the great wrapper :)

ThilinaRajapakse commented 4 years ago

Do you mind running with use_cuda=False and showing me the error?

DmLitov4 commented 4 years ago

@ThilinaRajapakse Hi, Just letting you know that I faced the same error 'RuntimeError: CUDA error: device-side assert triggered' today with the latest version(0.22.0). After trying a lot of things to overcome it, I tried a lower version of simpletransformers(0.20.0) in the end and it worked. I was trying this example: https://towardsdatascience.com/simple-transformers-introducing-the-easiest-bert-roberta-xlnet-and-xlm-library-58bf8c59b2a3

Thanks a lot for the great wrapper :)

I had the same issue. Switching to 0.20.0 version helped me, thanks!

ThilinaRajapakse commented 4 years ago

I can't replicate the issue guys. Is it only happening with the code from the Medium article?

DmLitov4 commented 4 years ago

I can't replicate the issue guys. Is it only happening with the code from the Medium article?

I've used my own dataset (multi-class classification). After downgrading on 0.20.0 (and then to 0.21.4) it works fine on GPU. Sometimes I have to restart the kernel but still it's OK. But now I have another issue simialr to #267 . During training and validating the loss score is OK, but when I try to predict, for different texts I got almost same probabilities. I've balanced the dataset, tried different hyperparameters, etc. But nothing helps.

ghost commented 4 years ago

@ThilinaRajapakse I followed the steps on medium article, though dataset was different. Will try to reproduce it again.

@DmLitov4 I was also running into same issues of same probabilities while predicting. I was iterating over the test dataset, and the probability weights and outputs for each of the data row was same. But then I passed the test set as whole to model.predict and then got different output values. Maybe you are also doing the same. @ThilinaRajapakse can maybe comment on this.

Thanks.

ThilinaRajapakse commented 4 years ago

The issue with similar predictions was a bug in some of the older versions. I'm guessing the versions you guys downgraded to was afflicted with it as well.

DmLitov4 commented 4 years ago

The issue with similar predictions was a bug in some of the older versions. I'm guessing the versions you guys downgraded to was afflicted with it as well.

Yes, I've heard about this issue. But I've used 0.21.4 version (without that problem) and it seems like the problem was hiding behind hyperparameters of my model. I've changed it a lot, retrained my model and now it words fine with both XLNet and Bert. Thank you so much your hard work, it helps us a lot.

ThilinaRajapakse commented 4 years ago

Just to clarify, some of you are still running into issues with the latest versions? :thinking:

You are welcome!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tyomik-mnemonic commented 3 years ago

Guys use CPU (it means use_cuda = False) if u faces this situation with GPU. If problem in data u can find better exeption output, and solve problem