facebookresearch / pytorch3d

PyTorch3D is FAIR's library of reusable components for deep learning with 3D data
https://pytorch3d.org/
Other
8.7k stars 1.3k forks source link

RuntimeError: CUDA error: no kernel image is available for execution on the device #1648

Closed harborsarah closed 11 months ago

harborsarah commented 12 months ago

Hi,

I would like to use chamfer_distance() loss function, and i installed the package via conda install pytorch3d -c pytorch3d on Linux/Unix. My Python version is 3.8, cuda=11.1, torch version 1.10.1. I tried to download the source code from the released version (0.7.2) and do pip install -e . Afterwards, when I run my code, this error appears:

File "/home/sfusion/users/huawei/depthestimation/loss.py", line 45, in forward loss, = chamfer_distance(x=input_points, y=target_points, y_lengths=target_lengths) File "/home/sfusion/virtualenvs/pytorch38/lib/python3.8/site-packages/pytorch3d-0.7.2/pytorch3d/loss/chamfer.py", line 158, in chamfer_distance x_nn = knn_points(x, y, lengths1=x_lengths, lengths2=y_lengths, norm=norm, K=1) File "/home/sfusion/virtualenvs/pytorch38/lib/python3.8/site-packages/pytorch3d-0.7.2/pytorch3d/ops/knn.py", line 187, in knn_points p1_dists, p1_idx = _knn_points.apply( File "/home/sfusion/virtualenvs/pytorch38/lib/python3.8/site-packages/pytorch3d-0.7.2/pytorch3d/ops/knn.py", line 72, in forward idx, dists = _C.knn_points_idx(p1, p2, lengths1, lengths2, norm, K, version) RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Do you have any idea about how to solve this? Thanks a lot for your help.

bottler commented 12 months ago

What type of GPU do you have / what compute capability?

Did you build on the same machine as you are trying to run on? This error might come if you build for one compute capability (typically, the default for a build is the compute capability of the GPU on the build machine) and try to run on a gpu with a different compute capability.

harborsarah commented 12 months ago

What type of GPU do you have / what compute capability?

Did you build on the same machine as you are trying to run on? This error might come if you build for one compute capability (typically, the default for a build is the compute capability of the GPU on the build machine) and try to run on a gpu with a different compute capability.

I am running on HPC server. There are different types of GPUs there. So the problem might come from that the code is built on e.g. Nvidia A30 but the actual running process is on TeslaP40?

harborsarah commented 12 months ago

What type of GPU do you have / what compute capability?

Did you build on the same machine as you are trying to run on? This error might come if you build for one compute capability (typically, the default for a build is the compute capability of the GPU on the build machine) and try to run on a gpu with a different compute capability.

Do you think if I install the package via a GPU with a smaller compute capability, this can run on a GPU with a bigger compute capability? Or the GPU type must be fixed in this case?

bottler commented 12 months ago

I think there is quite a lot of compatibility in the one direction but not the other - something like you can build for an older gpu and run on a newer. But for best performance you should do the right build. Any single build can be for multiple versions. When we build our release packages for pytorch3d we build for a whole range of compute capabilities, and you might as well do the same, at the cost of a slower build.

If you want to do what our releases do, you can set

export NVCC_FLAGS="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_50,code=compute_50"

and then rebuild (i.e. delete the build/ directory and try again).

harborsarah commented 12 months ago

I think there is quite a lot of compatibility in the one direction but not the other - something like you can build for an older gpu and run on a newer. But for best performance you should do the right build. Any single build can be for multiple versions. When we build our release packages for pytorch3d we build for a whole range of compute capabilities, and you might as well do the same, at the cost of a slower build.

If you want to do what our releases do, you can set

export NVCC_FLAGS="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_50,code=compute_50"

and then rebuild (i.e. delete the build/ directory and try again).

I tried building on an older GPU and running code on a newer GPU, but this error still occurs. So I think this might not be the correct way. I tried with your command but in my case, the error export: Command not found. appears. Since HPC is still different than linux, probably this is still not a way of solving it. And I think it is also not possible to build the source on different type of gpus at the same time right?

bottler commented 12 months ago

You can build for multiple GPUs at the same time. That's what the NVCC_FLAGS is doing. You can probably also do it with something like TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0".

export may be different depending on the shell you are using. You can try putting it all on one line like

TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0"  pip install -e .
harborsarah commented 11 months ago

You can build for multiple GPUs at the same time. That's what the NVCC_FLAGS is doing. You can probably also do it with something like TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0".

export may be different depending on the shell you are using. You can try putting it all on one line like

TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0"  pip install -e .

Understand. This is more clear. Now I solve the problem! Thank you so much for your reply.