elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
83 stars 21 forks source link

"Could not create cudnn handle" error #79

Closed gregszumel closed 3 months ago

gregszumel commented 3 months ago

Hi - I'm not 100% sure this is an EXLA error, but it's my best guess. I'm running into an issue when trying to do ops on tensors in on Cuda (see below). Do you know what might be causing this? I've tried a few things (playing around with :preallocate, :memory_fraction, reinstalling cudnn, downgrading CUDA, etc), but nothing has worked so far. I have verified that CuDNN was installed properly through here


# running Nx -> 0.7.1, Exla -> 0.7.1, xla -> 0.6.0
iex(1)> t = Nx.tensor([1], backend: EXLA.Backend)

08:25:16.940 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

08:25:16.942 [info] XLA service <service> initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

08:25:16.942 [info]   StreamExecutor device (0): NVIDIA RTX A6000, Compute Capability 8.6

08:25:16.942 [info] Using BFC allocator.

08:25:16.942 [info] XLA backend allocating 45932072140 bytes on device 0 for BFCAllocator.
#Nx.Tensor<
  s64[1]
  EXLA.Backend<cuda:0, 0.2762047049.2204500040.162304>
  [1]
>

iex(2)> Nx.add(t, t)

08:23:37.926 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

08:23:37.926 [error] Memory usage: 4734255104 bytes free, 51035635712 bytes total.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.7.1) lib/exla/mlir/module.ex:127: EXLA.MLIR.Module.unwrap!/1
    (exla 0.7.1) lib/exla/mlir/module.ex:113: EXLA.MLIR.Module.compile/5
    (stdlib 5.2.1) timer.erl:270: :timer.tc/2
    (exla 0.7.1) lib/exla/defn.ex:599: anonymous fn/12 in EXLA.Defn.compile/8
    (exla 0.7.1) lib/exla/mlir/context_pool.ex:10: anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
    (nimble_pool 1.0.0) lib/nimble_pool.ex:349: NimblePool.checkout!/4
    (exla 0.7.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    iex:1: (file)

Versions

polvalente commented 3 months ago

What's your cudnn version? IIRC we require cudnn8, not cudnn9

gregszumel commented 3 months ago

It is 9! I'll downgrade and report back

gregszumel commented 3 months ago

Fixed, thanks for the speedy reply!