Hi - I'm not 100% sure this is an EXLA error, but it's my best guess. I'm running into an issue when trying to do ops on tensors in on Cuda (see below). Do you know what might be causing this? I've tried a few things (playing around with :preallocate, :memory_fraction, reinstalling cudnn, downgrading CUDA, etc), but nothing has worked so far. I have verified that CuDNN was installed properly through here
# running Nx -> 0.7.1, Exla -> 0.7.1, xla -> 0.6.0
iex(1)> t = Nx.tensor([1], backend: EXLA.Backend)
08:25:16.940 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
08:25:16.942 [info] XLA service <service> initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
08:25:16.942 [info] StreamExecutor device (0): NVIDIA RTX A6000, Compute Capability 8.6
08:25:16.942 [info] Using BFC allocator.
08:25:16.942 [info] XLA backend allocating 45932072140 bytes on device 0 for BFCAllocator.
#Nx.Tensor<
s64[1]
EXLA.Backend<cuda:0, 0.2762047049.2204500040.162304>
[1]
>
iex(2)> Nx.add(t, t)
08:23:37.926 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
08:23:37.926 [error] Memory usage: 4734255104 bytes free, 51035635712 bytes total.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
(exla 0.7.1) lib/exla/mlir/module.ex:127: EXLA.MLIR.Module.unwrap!/1
(exla 0.7.1) lib/exla/mlir/module.ex:113: EXLA.MLIR.Module.compile/5
(stdlib 5.2.1) timer.erl:270: :timer.tc/2
(exla 0.7.1) lib/exla/defn.ex:599: anonymous fn/12 in EXLA.Defn.compile/8
(exla 0.7.1) lib/exla/mlir/context_pool.ex:10: anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
(nimble_pool 1.0.0) lib/nimble_pool.ex:349: NimblePool.checkout!/4
(exla 0.7.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
iex:1: (file)
Versions
OS: Ubuntu 22.04
Nvidia driver version: 545.29.06
CUDA version
> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
CuDNN installed via here and verified it works (through the verified section)
Hi - I'm not 100% sure this is an EXLA error, but it's my best guess. I'm running into an issue when trying to do ops on tensors in on Cuda (see below). Do you know what might be causing this? I've tried a few things (playing around with :preallocate, :memory_fraction, reinstalling cudnn, downgrading CUDA, etc), but nothing has worked so far. I have verified that CuDNN was installed properly through here
Versions