elixir-nx / nx

Multi-dimensional arrays (tensors) and numerical definitions for Elixir
2.66k stars 193 forks source link

Error loading CUDA libraries #1560

Closed andyl closed 2 days ago

andyl commented 2 days ago

For the past year I've been running NX successfully on Ubuntu 22.04 with an RTX 3060 GPU.

A couple weeks ago I had to re-install the OS. The new OS is also Ubuntu 22.04.

Since the re-install, all my GPU apps (ollama, fabric, aider, nvidia-smi, nvtop) work, but NX is broken.

Here's my NX test script:

#!/usr/bin/env elixir
IO.puts "--- START ---"

Mix.install(
  [
    {:exla, "~> 0.9"} 
  ],
  config: [
    nx: [
      default_backend: EXLA.Backend,
      default_defn_options: [compiler: EXLA]
    ],
    exla: [
      default_client: :cuda,
      clients: [
        host: [platform: :host],
        cuda: [platform: :cuda]
      ]
    ]
  ],
  system_env: [
    XLA_TARGET: "cuda12", 
  ]
)

IO.puts("AAA")
a = Nx.tensor([1, 2, 3])
IO.puts "BBB"
b = Nx.tensor([4, 5, 6])
IO.puts "CCC"
Nx.add(a, b) |> IO.inspect(label: "DDD")

IO.puts "--- END ---"

Here's the script output:

--- START ---
2024-11-13 09:27:00.593435: I xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
AAA
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1731518820.724893   22888 cuda_executor.cc:1040] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1731518820.726270   22857 service.cc:146] XLA service 0x7236dc427830 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1731518820.726300   22857 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
I0000 00:00:1731518820.726656   22857 se_gpu_pjrt_client.cc:889] Using BFC allocator.
I0000 00:00:1731518820.726711   22857 gpu_helpers.cc:114] XLA backend allocating 11264222822 bytes on device 0 for BFCAllocator.
I0000 00:00:1731518820.726730   22857 gpu_helpers.cc:154] XLA backend will use up to 1251580313 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1731518820.726834   22857 cuda_executor.cc:1040] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
BBB
CCC

09:27:00.854 [error] There was an error before creating cudnn handle (302): Error loading CUDA libraries. GPU will not be used. : Error loading CUDA libraries. GPU will not be used.

09:27:00.856 [error] There was an error before creating cudnn handle (302): Error loading CUDA libraries. GPU will not be used. : Error loading CUDA libraries. GPU will not be used.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.9.1) lib/exla/mlir/module.ex:147: EXLA.MLIR.Module.unwrap!/1
    (exla 0.9.1) lib/exla/mlir/module.ex:124: EXLA.MLIR.Module.compile/5
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2
    (exla 0.9.1) lib/exla/defn.ex:432: anonymous fn/14 in EXLA.Defn.compile/8
    (exla 0.9.1) lib/exla/mlir/context_pool.ex:10: anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
    (nimble_pool 1.1.0) lib/nimble_pool.ex:462: NimblePool.checkout!/4
    (exla 0.9.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2

Diagnostic output:

> nvidia-smi
Wed Nov 13 09:29:38 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:04:00.0 Off |                  N/A |
|  0%   21C    P8              6W /  170W |       4MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Here's how I install the Nvidia dependencies:

#!/usr/bin/env bash 

vsn="565"  # 550, 565

# Add Nvidia package repository 

# sudo apt update
# sudo apt install -y software-properties-common
# sudo add-apt-repository ppa:graphics-drivers/ppa -y
# sudo apt update

# Install the kernelspace driver 

sudo apt install -yq nvidia-driver-$vsn

# Install userspce CUDA and related packages

sudo apt install -yq \
    nvidia-utils-$vsn \
    nvidia-compute-utils-$vsn \
    nvidia-cuda-toolkit \
    nvidia-cuda-dev \
    nvidia-gds

# Install userspace cuDNN 

sudo apt install -yq \
    libcudnn9-cuda-12 -yq \
    libcudnn9-dev-cuda-12 -yq \
    libcudnn9-samples -yq 

# Monitoring tools
sudo snap install nvtop
sudo apt install -y pciutils usbutils

I'm using elixir 1.17.3-otp-27. The OS kernel is 6.8.0-48-generic. I also get this same error when running in an elixir project with a mix.exs file. There are shared library files in /usr/lib/x86_64-linux-gnu/libcuda.so and ~/.cache/mix/.../libexla.so. Hmm...

I'm out of ideas on how to fix! Any help appreciated!!!

polvalente commented 2 days ago

nvcc seems to be pointing out that you're using CUDA 11.5

Please ensure you're using CUDA 12.x

andyl commented 2 days ago

@polvalente - THANKS you found the problem - things are working now.