elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.33k stars 96 forks source link

Error running Stable Diffusion example #266

Closed gordoneliel closed 11 months ago

gordoneliel commented 11 months ago

Elixir version: 1.15.7 - OTP 26.1.2 Livebook version: 0.11.2 CUDA version: 12.3 GPU: RTX 4090

Trying out the stable diffusion examples and runnning into a CUDNN error on the "Text to image" section:

DNN library initialization failed. Look at the errors above for more details.


11:10:31.134 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

11:10:31.134 [info] XLA service 0x7f7cbc4bf190 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

11:10:31.134 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9

11:10:31.134 [info] Using BFC allocator.

11:10:31.134 [info] XLA backend allocating 22851610214 bytes on device 0 for BFCAllocator.

11:10:31.475 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

11:10:31.475 [error] Memory usage: 1562574848 bytes free, 25390678016 bytes total.

Not sure if its an OOM, gpu memory usage is 22.440Gi/23.988Gi (from nvtop) at the end of the error, but 24gb should be enough?

jonatanklosko commented 11 months ago

Hey @gordoneliel, do other models work for you?

[error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

This usually means cuDNN is not installed.

A couple version checks that may be helpful:

# Verify CUDA version
nvcc --version
# Verify cuDNN version, make sure it's installed and that the package matches CUDA version
apt-cache policy libcudnn8 | head -n 3
# Check drivers and CUDA support
nvidia-smi
gordoneliel commented 11 months ago

NVCC:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

libcudnn8

libcudnn8:
  Installed: (none)
  Candidate: 8.9.5.29-1+cuda12.2

nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:05:00.0  On |                  Off |
|  0%   42C    P8              22W / 450W |    492MiB / 24564MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2544      G   /usr/lib/xorg/Xorg                          199MiB |
|    0   N/A  N/A      2653      G   /usr/bin/gnome-shell                         80MiB |
|    0   N/A  N/A      3130    C+G   ...5020629,13399252839037686317,262144      180MiB |
+---------------------------------------------------------------------------------------+
jonatanklosko commented 11 months ago

Build cuda_12.3.r12.3

Can you try downgrading to CUDA 12.2?

gordoneliel commented 11 months ago

@jonatanklosko You wont believe this but I installed libcudnn9 via apt (thought I already had it?) and it started working!

jonatanklosko commented 11 months ago

Perfect, for completeness you mean libcudnn8, right?

jonatanklosko commented 11 months ago

Oh, I somehow missed it:

libcudnn8:
  Installed: (none)
  Candidate: 8.9.5.29-1+cuda12.2

See that it says installed none, so you had the repositories added, but it was indeed not installed :)

gordoneliel commented 11 months ago

Ah, missed that too! Thanks for the command, really helpful!