Luo-Z13 / pointobb

[CVPR2024] PointOBB: Learning Oriented Object Detection via Single Point Supervision
MIT License
49 stars 3 forks source link

RuntimeError during training: CUDA error: invalid configuration argument #23

Closed ShirleySyh closed 1 week ago

ShirleySyh commented 3 weeks ago

Hi, thank you for your excellent work! There's an error occurred during my implementation of your code. When I train in DOTAv1.0 dataset with config file "configs2/pointobb/pointobb_r50_fpn_2x_dota10.py", there will always occur "RuntimeError: CUDA error: invalid configuration argument" during the training in 1st epoch. I think the problem is related to Line 352 in code PointOBB/mmdet/models/detectors/utils.py, that is box_iou_rotated function in ext_module,but I don't know if I'm right. Have you ever run into this error during your implementation? The details of the error and my virtual environment are attached below.Looking forward for your reply. Thank you! pointobb_out.txt

problem1
Luo-Z13 commented 2 weeks ago

box_iou_rotated

Hello, first you can check if the CUDA driver and PyTorch versions are compatible; then, you can print the input tensors of box_iou_rotated to verify if the format of the input bounding boxes is correct. Finally, if all the above results are normal, I believe it might be a hardware issue.

ShirleySyh commented 2 weeks ago

box_iou_rotated

Hello, first you can check if the CUDA driver and PyTorch versions are compatible; then, you can print the input tensors of box_iou_rotated to verify if the format of the input bounding boxes is correct. Finally, if all the above results are normal, I believe it might be a hardware issue.

Thank you for your reply! I changed my conda environment to torch 1.13.0 and cuda117, and I haven't met the invalid configuration argument error since. Torch version lower than 1.13.0 would all encounter this problem on my machine. With the above problem solved, there is a memory leakage problem occurred during my training. I use one RTX 3090 with 24G, set batchsize to 2. The allocated memory will continue increasing during the training and the CUDA out of memory problem will occur. I am still trying to locate the code that caused this memory leakage. It would be of great help if you can give some suggestions. Thank you!

Luo-Z13 commented 2 weeks ago

box_iou_rotated

Hello, first you can check if the CUDA driver and PyTorch versions are compatible; then, you can print the input tensors of box_iou_rotated to verify if the format of the input bounding boxes is correct. Finally, if all the above results are normal, I believe it might be a hardware issue.

Thank you for your reply! I changed my conda environment to torch 1.13.0 and cuda117, and I haven't met the invalid configuration argument error since. Torch version lower than 1.13.0 would all encounter this problem on my machine. With the above problem solved, there is a memory leakage problem occurred during my training. I use one RTX 3090 with 24G, set batchsize to 2. The allocated memory will continue increasing during the training and the CUDA out of memory problem will occur. I am still trying to locate the code that caused this memory leakage. It would be of great help if you can give some suggestions. Thank you!

Hi, you can refer to https://github.com/Luo-Z13/pointobb/blob/main/environment.yml to check your environment.

ShirleySyh commented 1 week ago

Hi, thank you for your reply. I finally solve the problem I met. It turns out that I link the cuda path in the system environment to my afore-downloaded local cuda driver with version 11.3, which cause the incompatible problem with my downloaded PyTorch. With the change of the correct cuda driver, the code runs successfully. Thanks again for your excellent work!