Closed nmdaniil closed 9 months ago
Hey @nmdaniil, what is your CUDA and cuDNN version?
Hey @nmdaniil, what is your CUDA and cuDNN version?
CUDA 11.8.89
cuDNN 8.9.4
I'm running inside a docker container
FROM tensorflow/tensorflow:2.13.0-gpu as tensorflow
This sounds good. You can try this config to see if it's related to preallocation:
exla: [
clients: [
host: [platform: :host],
cuda: [platform: :cuda, preallocate: false]
]
]
Nothing has changed, exactly the same error
Mix.install(
[
{:nx, "0.6.0"},
{:exla, "0.6.0"}
],
config: [
nx: [
default_backend: EXLA.Backend,
default_defn_options: [compiler: EXLA]
],
exla: [
clients: [
host: [platform: :host],
cuda: [platform: :cuda, preallocate: false]
]
]
],
system_env: [
XLA_TARGET: "cuda118"
]
)
16:33:45.836 [error] Possibly insufficient driver version: 520.61.5
Perhaps you could update the drivers?
16:33:45.836 [error] Possibly insufficient driver version: 520.61.5
Perhaps you could update the drivers?
Before that I had the 470 driver. I upgraded to 520, but that didn't fix the problem.
Why did I upgrade to 520? Because this version of the driver matches cuda version 11.8. If I upgrade to the 535 driver, it already has cuda 12.2 and I need 11.8 since tensorflow supports cuda 11.*.
But still I upgraded to 535 with cuda 12.2, and now a new error:
10:22:46.063 [info] XLA backend will use up to 15202123776 bytes on device 0 for BFCAllocator.
Could not load library libcublasLt.so.12. Error: libcublasLt.so.12: cannot open shared object file: No such file or directory
Although if I put a lower version, everything works too ✅
{:nx, "0.5.0"},
{:exla, "0.5.0"}
EXLA 0.6.0 supports cuda 12 (for that you need XLA_TARGET=cuda120
), but 0.5.0 does not.
I don't really know what else could cause the original issue. @seanmor5 have you run into something like this before?
Set this value XLA_TARGET=cuda120
another error:
11:05:45.737 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
11:05:45.737 [info] XLA service 0x7f2ed8010430 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
11:05:45.737 [info] StreamExecutor device (0): NVIDIA RTX A4000, Compute Capability 8.6
11:05:45.737 [info] Using BFC allocator.
11:05:45.737 [info] XLA backend will use up to 15202123776 bytes on device 0 for BFCAllocator.
11:05:45.742 [error] There was an error before creating cudnn handle (302): cudaGetErrorName symbol not found. : cudaGetErrorString symbol not found.
For the reference, please paste the output from these commands (wherever you are running the elixir code):
nvcc --version
apt-cache policy libcudnn8 | head -n 3
nvidia-smi
1.
tf-docker /app > nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
2.
tf-docker /app > apt-cache policy libcudnn8 | head -n 3
libcudnn8:
Installed: 8.9.4.25-1+cuda12.2
Candidate: 8.9.4.25-1+cuda12.2
3.
tf-docker /app > nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 On | 00000000:2D:00.0 Off | Off |
| 41% 48C P8 8W / 140W | 1MiB / 16376MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
tf-docker /app > nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 On | 00000000:2D:00.0 Off | Off |
| 41% 47C P8 15W / 140W | 1905MiB / 16376MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
libcudnn8:
Installed: 8.9.4.25-1+cuda12.2
The cuDNN packages is is installed for cuda 12, but your cuda is 11.8. Try apt-get install libcudnn8=8.9.3.28-1+cuda11.8
.
Thanks, that helped! 👍
Added this installation to the docker image (tensorflow/tensorflow:2.13.0-gpu) and everything worked!
FROM tensorflow/tensorflow:2.13.0-gpu as tensorflow
RUN apt-get update && apt-get install --no-install-recommends --allow-downgrades -y libcudnn8=8.9.3.28-1+cuda11.8
Awesome! :D
Hi!
The
exla 0.6.0
library includes thexla 0.5.0
version in its dependencies. And now this code is giving an error.Although before this version (
xla 0.4.4
) everything was fine.Example
Error logs (CUDNN_STATUS_NOT_INITIALIZED)