elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
83 stars 21 forks source link

CUDNN_STATUS_INTERNAL_ERROR on 12.2? #56

Closed ityonemo closed 8 months ago

ityonemo commented 8 months ago

I had some serious struggles with cuda 11.8 (Exla-0.6 fails on this platform) so I upgraded to Cuda 12, but I wound up with 12.2:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     On  | 00000000:06:00.0 Off |                    0 |
|  0%   30C    P8              17W / 150W |     18MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1699      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

This seems to cause CUDNN_STATUS_INTERNAL_ERROR.

Downgrading to 11.8 and Exla-0.5 works (but other libraries, e.g. Bumblebee) fail on Exla-0.5

seanmor5 commented 8 months ago

Can you send the logs around that internal error?

josevalim commented 8 months ago

And please include your relevant XLA_TARGETs :)

ityonemo commented 8 months ago

thanks! Which logs should I send?

XLA_TARGET=cuda120

josevalim commented 8 months ago

Thank you, and what is the CUDNN version?

josevalim commented 8 months ago

You can also try building XLA from source and see if you have better luck.

ityonemo commented 8 months ago

unfortunately, building XLA from source stopped with Inconsistent CUDA toolkit path: /usr vs /usr/lib possibly because I switched from 11.8 to 12.x?

ityonemo commented 8 months ago

I actually can't figure out how to find out what cudnn version I have directly. Some instructions on how to determine these in the readme might be helpful. I'l make a pr. Also a lot of people don't know this, but nvidia-smi will lie about the cuda version (the only way to know for sure is nvcc -V).

jonatanklosko commented 8 months ago

@ityonemo what OS do you use? On Debian/Ubuntu you can usually find cuDNN package version with apt-cache policy libcudnn8.

ityonemo commented 8 months ago

Unable to locate package libcudnn8

I guess i don't have cudnn installed. Or i might have accidentally wiped it when i purged 11.8 =(

Ok, thanks. I think we can close this, will reopen if i install cudnn and can't get it working