ELOESZHANG / MPCF--3d_object_detection

We propose a method MPCF.
Apache License 2.0
8 stars 0 forks source link

RuntimeError: CUDA error: device-side assert triggered #2

Open lcy199905 opened 3 weeks ago

lcy199905 commented 3 weeks ago

When I used a 4090 and the environment was torch2.1 cuda12.2 to train, it was normal training at first, but after a few rounds of training it would report errors. What is the reason for this?

File "/my/notebook_work/lcy/mpcf-my-v1/pcdet/models/roi_heads/mpcf_head.py", line 599, in get_box_cls_layer_loss tb_dict = {'rcnn_loss_cls': rcnn_loss_cls.item()} RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

PanGao-1 commented 2 weeks ago

"I apologize for the late response. In fact, I have only attempted training on CUDA versions 11.1, 11.3, and 11.7. My recommendation is to install a lower version of PyTorch, such as 1.10.1 with CUDA 11.3. Additionally, I strongly suggest setting the batch size to 1, as using multiple batches may lead to some errors. Setting the workers to 0 can also yield more stable results, especially if you're aiming for high metrics.

pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install spconv-cu113

If you wish to uninstall spconv, please also uninstall cumm, for example:

pip uninstall spconv-cu113 cumm-cu113
PanGao-1 commented 2 weeks ago

When I used a 4090 and the environment was torch2.1 cuda12.2 to train, it was normal training at first, but after a few rounds of training it would report errors. What is the reason for this?

File "/my/notebook_work/lcy/mpcf-my-v1/pcdet/models/roi_heads/mpcf_head.py", line 599, in get_box_cls_layer_loss tb_dict = {'rcnn_loss_cls': rcnn_loss_cls.item()} RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Your issue seems to be related to the batchsize settings and the version of PyTorch. I recommend using the batchsize ==1 and a version of PyTorch lower than 1.8. And You can also try commenting out the line" tb_dict = {'rcnn_loss_cls': rcnn_loss_cls.item()}". This will not affect training or inference.