elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
83 stars 21 forks source link

`CUDNN_STATUS_NOT_INITIALIZED` with XLA `0.5.0` and CUDA `11.8` #53

Closed nmdaniil closed 9 months ago

nmdaniil commented 10 months ago

Hi!

The exla 0.6.0 library includes the xla 0.5.0 version in its dependencies. And now this code is giving an error.

Although before this version (xla 0.4.4) everything was fine.

Example

Nx.with_default_backend({EXLA.Backend, client: :cuda}, fn ->
  Nx.iota({10, 10})
  |> Nx.add(10)
end)

Error logs (CUDNN_STATUS_NOT_INITIALIZED)

16:33:45.833 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

16:33:45.833 [info] XLA service 0x7f343c699050 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

16:33:45.833 [info] StreamExecutor device (0): NVIDIA RTX A4000, Compute Capability 8.6

16:33:45.833 [info] Using BFC allocator.

16:33:45.833 [info] XLA backend allocating 15210204364 bytes on device 0 for BFCAllocator.

16:33:45.836 [error] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED

16:33:45.836 [error] Memory usage: 16730750976 bytes free, 16900227072 bytes total.

16:33:45.836 [error] Possibly insufficient driver version: 520.61.5

mix.exs

Mix.install(
[
{:nx, "0.6.0"},
{:exla, "0.6.0"}
],
config: [
nx: [
default_backend: EXLA.Backend,
default_defn_options: [compiler: EXLA]
],
exla: [
default_client: :host
]
],
system_env: [
XLA_TARGET: "cuda118"
]
)

If you replace with these versions, it will work ✅

{:nx, "0.5.0"},
{:exla, "0.5.0"}
jonatanklosko commented 10 months ago

Hey @nmdaniil, what is your CUDA and cuDNN version?

nmdaniil commented 10 months ago

Hey @nmdaniil, what is your CUDA and cuDNN version?

CUDA 11.8.89 cuDNN 8.9.4

I'm running inside a docker container

FROM tensorflow/tensorflow:2.13.0-gpu as tensorflow
jonatanklosko commented 10 months ago

This sounds good. You can try this config to see if it's related to preallocation:

exla: [
  clients: [
    host: [platform: :host],
    cuda: [platform: :cuda, preallocate: false]
  ]
]
nmdaniil commented 10 months ago

Nothing has changed, exactly the same error

Updated mix.exs

Mix.install(
  [
    {:nx, "0.6.0"},
    {:exla, "0.6.0"}
  ],
  config: [
    nx: [
      default_backend: EXLA.Backend,
      default_defn_options: [compiler: EXLA]
    ],
    exla: [
      clients: [
        host: [platform: :host],
        cuda: [platform: :cuda, preallocate: false]
      ]
    ]
  ],
  system_env: [
    XLA_TARGET: "cuda118"
  ]
)
jonatanklosko commented 10 months ago

16:33:45.836 [error] Possibly insufficient driver version: 520.61.5

Perhaps you could update the drivers?

nmdaniil commented 9 months ago

16:33:45.836 [error] Possibly insufficient driver version: 520.61.5

Perhaps you could update the drivers?

Before that I had the 470 driver. I upgraded to 520, but that didn't fix the problem.

Why did I upgrade to 520? Because this version of the driver matches cuda version 11.8. If I upgrade to the 535 driver, it already has cuda 12.2 and I need 11.8 since tensorflow supports cuda 11.*.

But still I upgraded to 535 with cuda 12.2, and now a new error:

10:22:46.063 [info] XLA backend will use up to 15202123776 bytes on device 0 for BFCAllocator.
Could not load library libcublasLt.so.12. Error: libcublasLt.so.12: cannot open shared object file: No such file or directory

Although if I put a lower version, everything works too ✅

    {:nx, "0.5.0"},
    {:exla, "0.5.0"}
jonatanklosko commented 9 months ago

EXLA 0.6.0 supports cuda 12 (for that you need XLA_TARGET=cuda120), but 0.5.0 does not.

I don't really know what else could cause the original issue. @seanmor5 have you run into something like this before?

nmdaniil commented 9 months ago

Set this value XLA_TARGET=cuda120

another error:

11:05:45.737 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

11:05:45.737 [info] XLA service 0x7f2ed8010430 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

11:05:45.737 [info]   StreamExecutor device (0): NVIDIA RTX A4000, Compute Capability 8.6

11:05:45.737 [info] Using BFC allocator.

11:05:45.737 [info] XLA backend will use up to 15202123776 bytes on device 0 for BFCAllocator.

11:05:45.742 [error] There was an error before creating cudnn handle (302): cudaGetErrorName symbol not found. : cudaGetErrorString symbol not found.
jonatanklosko commented 9 months ago

For the reference, please paste the output from these commands (wherever you are running the elixir code):

  1. nvcc --version
  2. apt-cache policy libcudnn8 | head -n 3
  3. nvidia-smi
nmdaniil commented 9 months ago

If nvidia driver = 520

1.

tf-docker /app > nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

2.

tf-docker /app > apt-cache policy libcudnn8 | head -n 3
libcudnn8:
  Installed: 8.9.4.25-1+cuda12.2
  Candidate: 8.9.4.25-1+cuda12.2

3.

tf-docker /app > nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    On   | 00000000:2D:00.0 Off |                  Off |
| 41%   48C    P8     8W / 140W |      1MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If nvidia driver 525, the output of commands 1 and 2 will be exactly the same, only nvidia-smi will change:

tf-docker /app > nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    On   | 00000000:2D:00.0 Off |                  Off |
| 41%   47C    P8    15W / 140W |   1905MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
jonatanklosko commented 9 months ago
libcudnn8:
  Installed: 8.9.4.25-1+cuda12.2

The cuDNN packages is is installed for cuda 12, but your cuda is 11.8. Try apt-get install libcudnn8=8.9.3.28-1+cuda11.8.

nmdaniil commented 9 months ago

Thanks, that helped! 👍

Added this installation to the docker image (tensorflow/tensorflow:2.13.0-gpu) and everything worked!

FROM tensorflow/tensorflow:2.13.0-gpu as tensorflow
RUN apt-get update && apt-get install --no-install-recommends --allow-downgrades -y libcudnn8=8.9.3.28-1+cuda11.8
jonatanklosko commented 9 months ago

Awesome! :D