RuntimeError: CUDA error: no kernel image is available for execution on the device 2024-05-15T10:49:54.910116296Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. 2024-05-15T10:49:54.910125144Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1

qiuqc1 commented 1 month ago

Version： Pytorch：1.10.1 Cuda：11.1 chamferdist：1.0.0

Traceback (most recent call last): 2024-05-15T10:49:54.908855819Z File "./tools/test.py", line 266, in 2024-05-15T10:49:54.908901447Z main() 2024-05-15T10:49:54.908922825Z File "./tools/test.py", line 237, in main 2024-05-15T10:49:54.909075639Z outputs = custom_multi_gpu_test(model, data_loader, args.tmpdir, 2024-05-15T10:49:54.909088268Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/apis/test.py", line 72, in custom_multi_gpu_test 2024-05-15T10:49:54.909092798Z result = model(return_loss=False, rescale=True, data) 2024-05-15T10:49:54.909096399Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl 2024-05-15T10:49:54.909245735Z return forward_call(*input, *kwargs) 2024-05-15T10:49:54.909256195Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward 2024-05-15T10:49:54.909347536Z output = self.module(inputs[0], kwargs[0]) 2024-05-15T10:49:54.909353283Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl 2024-05-15T10:49:54.909470394Z return forward_call(input, kwargs) 2024-05-15T10:49:54.909476617Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/detectors/bevformer.py", line 156, in forward 2024-05-15T10:49:54.909529465Z return self.forward_test(kwargs) 2024-05-15T10:49:54.909533659Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/detectors/vidar.py", line 469, in forward_test 2024-05-15T10:49:54.909664066Z e2e_predictor_utils.compute_chamfer_distance_inner( 2024-05-15T10:49:54.909674936Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/utils/e2e_predictor_utils.py", line 183, in compute_chamfer_distance_inner 2024-05-15T10:49:54.909709721Z return compute_chamfer_distance(inner_pred_pcd, inner_gt_pcd) 2024-05-15T10:49:54.909718060Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/utils/e2e_predictor_utils.py", line 166, in compute_chamfer_distance 2024-05-15T10:49:54.909726361Z loss_src, lossdst, = chamfer_distance( 2024-05-15T10:49:54.909733479Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl 2024-05-15T10:49:54.909867152Z return forward_call(input, **kwargs) 2024-05-15T10:49:54.909881111Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/chamferdist/chamfer.py", line 77, in forward 2024-05-15T10:49:54.909889793Z source_nn = knn_points( 2024-05-15T10:49:54.909896791Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/chamferdist/chamfer.py", line 280, in knn_points 2024-05-15T10:49:54.909939384Z p1_dists, p1_idx = _knn_points.apply( 2024-05-15T10:49:54.909949931Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/chamferdist/chamfer.py", line 176, in forward 2024-05-15T10:49:54.910052040Z idx, dists = _C.knn_points_idx(p1, p2, lengths1, lengths2, K, version) 2024-05-15T10:49:54.910106309Z RuntimeError: CUDA error: no kernel image is available for execution on the device 2024-05-15T10:49:54.910116296Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. 2024-05-15T10:49:54.910125144Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Have you ever encountered such a problem? When I was training with 8 A100s in a server cluster, an error occurred in the final eval report. There was no problem in the first 24 epochs. The normal training loss decreased, and the corresponding checkpoint file was generated, and then separately When I tested the 24-round pth file, it failed because of this. Later, I used the same mirror environment and could test normally on 4090. May I ask why this problem occurs?

tomztyang commented 1 month ago

Seems like something is wrong with the chamferdist package. Maybe try installing the chamferdist package following the 4docc?

qiuqc1 commented 1 month ago

I reinstalled chamferdist in the docker environment, and checked all the required environments with the connection you gave me. There was a conflict on my side, which caused the version of numpy and setuptools to be inconsistent with the version required by chamferdist. Now that I have uninstalled this irrelevant package, there should be no conflict issues in the environment. But when I used this newly made image to try to run the eval code today, I still got an error, and it was the same error path. It should not be an environmental problem. Still looking for problems

OpenDriveLab / ViDAR