RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)`

86kkd commented 1 year ago

Exception has occurred: RuntimeError cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling cusolverDnCreate(handle) File "/workspaces/UniAD-main/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py", line 270, in velo_update g2l_r = torch.linalg.inv(l2g_r2).type(torch.float) File "/workspaces/UniAD-main/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py", line 643, in _forward_single_frame_inference ref_pts = self.velo_update( File "/workspaces/UniAD-main/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py", line 748, in simple_test_track frame_res = self._forward_single_frame_inference( File "/workspaces/UniAD-main/projects/mmdet3d_plugin/uniad/detectors/uniad_e2e.py", line 292, in forward_test result_track = self.simple_test_track(img, l2g_t, l2g_r_mat, img_metas, timestamp) File "/workspaces/UniAD-main/projects/mmdet3d_plugin/uniad/detectors/uniad_e2e.py", line 83, in forward return self.forward_test(kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 799, in forward output = self.module(inputs[0], kwargs[0]) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/workspaces/UniAD-main/projects/mmdet3d_plugin/uniad/apis/test.py", line 90, in custom_multi_gpu_test result = model(return_loss=False, rescale=True, data) File "/workspaces/UniAD-main/tools/test.py", line 231, in main outputs = custom_multi_gpu_test(model, data_loader, args.tmpdir, File "/workspaces/UniAD-main/tools/test.py", line 261, in main() RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling cusolverDnCreate(handle) 这个项目的检测效果看起来很神奇我就想自己复现一下，我使用docker配置的项目环境项目跑起来后在迭代dataloder的数据用模型进行推理时出现这个错误，神奇的是在这个错误发生前代码已经完成datalode迭代的第一个data的推理并且得到了推理结果，但当迭代第二个data时却出现这个问题，我在这个问题上困扰了很长时间，如果有大神能给予一些启发性的指点我将感激不尽

The detection effect of this project looks amazing, so I wanted to reproduce it myself. I used Docker to configure the project environment. After the project was running, when iterating through the dataloader's data and using the model for inference, this error occurred. What's amazing is that before this error occurred, the code had already completed the inference of the first set of data from the dataloader and obtained the inference results. However, when iterating to the second set of data, this problem arose.I have been troubled by this issue for a long time. If any expert could give me some enlightening guidance, I would be immensely grateful.

86kkd commented 1 year ago

i have read the previous issue it is interesting that it only occur in 4090and4080(mine),but i have no idea where is bug come from since my computer shows that gpu memory shared memory and cpu memory are all available.

YTEP-ZHI commented 1 year ago

i have read the previous issue it is interesting that it only occur in 4090and4080(mine),but i have no idea where is bug come from since my computer shows that gpu memory shared memory and cpu memory are all available.

@86kkd Yes, it seems like a common issue on those devices, but it never occurs on V100 and A100. It would be greatly appreciated if anyone in this community could help on this.

DocAllen commented 1 year ago

torch.linalg.inv() is common using in many projects, it seems they let cpu to handle this calculation. I same use single RTX4090 and same error occurred. Solution: change the code in _projects/mmdet3d_plugin/uniad/detectors/uniadtrack.py line 270: g2l_r = torch.linalg.inv(l2g_r2).type(torch.float) to g2l_r = torch.linalg.inv(l2g_r2.cpu()).type(torch.float).cuda(0) I guess in multi GPUs, get the gpu number of tensor l2g_r2 and then: `g2l_r = torch.linalg.inv(l2g_r2.cpu()).type(torch.float).cuda()` BTW, thanks to the authors for their open source spirit.

StudentWwg commented 2 months ago

torch.linalg.inv() is common using in many projects, it seems they let cpu to handle this calculation. I same use single RTX4090 and same error occurred. Solution: change the code in _projects/mmdet3d_plugin/uniad/detectors/uniadtrack.py line 270: g2l_r = torch.linalg.inv(l2g_r2).type(torch.float) to g2l_r = torch.linalg.inv(l2g_r2.cpu()).type(torch.float).cuda(0) I guess in multi GPUs, get the gpu number of tensor l2g_r2 and then: `g2l_r = torch.linalg.inv(l2g_r2.cpu()).type(torch.float).cuda()` BTW, thanks to the authors for their open source spirit.

This solution is effective. But I still face a new problem that is 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!'. Thus I change to use only one GPU to run the evaluation program, and it's ok.

OpenDriveLab / UniAD

RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)` #54