CUDA kernel failed: invalid device function

meltzerpete commented 4 years ago

train/test produce the following error:

Validation sanity check: 0it [00:00, ?it/s]CUDA kernel failed : invalid device function
void furthest_point_sampling_kernel_wrapper(int, int, int, const float*, float*, int*) at L:228 in pointnet2_ops/_ext-src/src/sampling_gpu.cu

I have run the training and test before successfully on this machine and cannot work out why now it fails - I do not think I have changed anything about the environment.

output of nvidia-smi:

Sun Jul  5 13:14:30 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    24W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

import torch
torch.version.cuda
> 10.1
torch.cuda.is_available()
> True

versions ($ conda list):

pointnet2                 3.0.0                     <pip>
pointnet2-ops             3.0.0                     <pip>
python                    3.7.7                hcff3b4d_5  
pytorch-lightning         0.7.6                     <pip>

(I have also tried changing to other versions of pytorch-lightning - 0.7.1/0.84).

Any help greatly appreciated.

erikwijmans commented 4 years ago

Please make sure your nvcc version is also CUDA 10.1. You can check with nvcc --version.

meltzerpete commented 4 years ago

thanks for reply, I have

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

I guess this is the problem?

erikwijmans commented 4 years ago

Yeah, things should work if you install at 10.0 version of pytorch or can get 10.1 compilation tools

meltzerpete commented 4 years ago

I installed 10.1 compilation tools with $ conda install cudatoolkit-dev -c conda-forge, so I now have

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

I also reinstalled the pointnet2-ops package with

$ pip install --user --force-reinstall --ignore-installed --no-binary :all: pointnet2_ops_lib

but I am still getting the same error.

I have also tried installing the pytorch version with cu100 using $ conda install pytorch torchvision cudatoolkit=10.0 -c pytorch but am getting the same error.

lluma commented 3 years ago

I have met the same problem mentioned here, and the solution I took is to reinstall the pytorch and downgrade to 1.4 version with consistent cuda version 10.0 (as the same as the version from nvcc -V), then reinstalled the pointnet2-ops package. Finally, the error was gone away.

Python version: 3.6.12 Pytorch version: 1.4.0 Cuda version: 10.0

conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.0 -c pytorch

Hope my solution can help someone encountering this same issue.

wen-yuan-zhang commented 3 years ago

I have met the same problem mentioned here, and the solution I took is to reinstall the pytorch and downgrade to 1.4 version with consistent cuda version 10.0 (as the same as the version from nvcc -V), then reinstalled the pointnet2-ops package. Finally, the error was gone away.

Python version: 3.6.12 Pytorch version: 1.4.0 Cuda version: 10.0
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.0 -c pytorch
Hope my solution can help someone encountering this same issue.

I tried this method and it works correctly, thanks! I think this bug may result from cudatoolkit version. It seems that cudatoolkit=10.0 works but cudatoolkit=10.1 doesn't work.

wzjscut commented 6 months ago

i have the same problem. which version of cuda or nvcc or torch should i use?

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0

* please refor https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html to for Minimum Required Driver Version for CUDA Minor Version Compatibility driver version is 535.161.07 sys.version : 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0] torch version : 2.3.0.dev20231227 installed cuda version : 12.1 CUDA Compute Capability: 8.6 Microarchitecture Name: Ampere (3090, cuda >= 11.1, driver >=455.32) pytorch compiled for : ['sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90'] torch.cuda.is_available : True torch.backends.cudnn.enabled : True torch.cuda.get_device_properties(device) : _CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24237MB, multi_processor_count=82) SYSTEM CUDA_PATH: None LD_LIBRARY_PATH: /root/Workspace/hdl_loc/devel/lib:/root/Workspace/ws_livox/devel/lib:/opt/ros/noetic/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 torch.tensor([1.0, 2.0]).cuda() : tensor([1., 2.], device='cuda:0')

HansRen1024 commented 4 months ago

Here is the best solution: https://github.com/mkt1412/GraspGPT_public/issues/8

erikwijmans / Pointnet2_PyTorch

CUDA kernel failed: invalid device function #121