CUDA_ERROR_NO_DEVICE "no CUDA-capable device is detected"

EricLBuehler commented 5 months ago

Hello all,

Thanks for your great work here! When I run using cudarc, I get the error:

called `Result::unwrap()` on an `Err` value: Cuda(Cuda(DriverError(CUDA_ERROR_NO_DEVICE, "no CUDA-capable device is detected")))

Here is my system information:

$ nvidia-smi
Tue Jun 11 23:53:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.72                 Driver Version: 536.45       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro M2000M                  On  | 00000000:01:00.0 Off |                  N/A |
| N/A    0C    P8              N/A / 200W |      0MiB /  4096MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        33      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
5.0

$ echo $CUDA_VISIBLE_DEVICES
0

I would appreciate any help!

coreylowman commented 5 months ago

Is pytorch able to see the GPU? Also what cuda toolkit version is being targeted by cudarc (if using cuda-version-from-build-system, is it being compiled on this machine?)

coreylowman commented 3 months ago

@EricLBuehler any more information on this issue? Will close in a week if not

EricLBuehler commented 3 months ago

@coreylowman sorry for not getting back! I am running this on my GPU and Pytorch can see it (torch.cuda.is_available() == True).

coreylowman commented 3 months ago

@EricLBuehler are there any differences with dynamic loading vs dynamic linking features for cudarc? Also curious about what toolkit version you are targeting in cudarc features

EricLBuehler commented 3 months ago

I am using cuda-version-from-build-system and dynamic-linking. How should I try dynamic loading?

coreylowman commented 3 months ago

If you don't enable the dynamic-linking feature it will use dynamic loading.

🤔 Could you try targeting 12.2 (cuda-12020) instead of version from build system? Just curious if that would change anything.

EricLBuehler commented 3 months ago

Hmm yeah, same error. Current:

cudarc = { version = "0.11.5", features = ["std", "cublas", "cublaslt", "curand", "driver", "nvrtc", "f16", "cuda-12020"], default-features=false }

coreylowman commented 3 months ago

I got nothing off the top of my head. Do you get this error if you git clone cudarc and try to run the unit tests?

cargo test --tests --no-default-features -F std,cuda-12050,driver

Is this running inside a docker container?

If that doesn't work I'd probably try to go to c++ level and verify a simple example there that links to cuda finds gpu. If that doesn't work then that at least tells us that pytorch is doing something special that we need to copy.

jianshu93 commented 3 months ago

Hi both, I also have as similar error:

DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Aborted [jzhao399@atl1-1-02-018-25-0 release]$ which nvidia-smi /usr/bin/nvidia-smi [jzhao399@atl1-1-02-018-25-0 release]$ nvidia-smi Wed Jul 17 11:25:54 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-PCIE-40GB On | 00000000:C1:00.0 Off | 0 | | N/A 34C P0 43W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

via PyTorch, this can be solved but not sure how to solve here.

Thanks,

Jianshu

coreylowman / cudarc

CUDA_ERROR_NO_DEVICE "no CUDA-capable device is detected" #253