XLA unsupported on GFX1100

Mix.install(
  [
    {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
    {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
    {:kino, "~> 0.11.2"},
    {:bumblebee, "~> 0.4.2"},
    {:kino_bumblebee, "~> 0.4.0"}
  ],
  system_env: %{
    "XLA_ARCHIVE_URL" =>
      "https://static.jonatanklosko.com/builds/0.6.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz",
    "ROCM_PATH" => "/opt/rocm-5.6.0"
  },
  config: [nx: [default_backend: {EXLA.Backend, client: :host}]]
)

Using the above setup, I'm unable to get nearly any of the tasks under the "Neural Net Smartcell" tasks in Livebook working on my 7900XTX.

With Stable Diffusion, I get the following errors:


18:10:28.124 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

18:10:28.124 [info] StreamExecutor ROCM device (0) is of unsupported AMDGPU version : gfx1100. The supported AMDGPU versions are gfx900, gfx906, gfx908, gfx90a, gfx1030.

18:10:28.127 [error] GenServer EXLA.Client terminating
** (RuntimeError) no supported devices found for platform ROCM
    (exla 0.7.0-dev) lib/exla/client.ex:196: EXLA.Client.unwrap!/1
    (exla 0.7.0-dev) lib/exla/client.ex:173: EXLA.Client.build_client/2
    (exla 0.7.0-dev) lib/exla/client.ex:136: EXLA.Client.handle_call/3
    (stdlib 5.1.1) gen_server.erl:1113: :gen_server.try_handle_call/4
    (stdlib 5.1.1) gen_server.erl:1142: :gen_server.handle_msg/6
    (stdlib 5.1.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.316.0>): {:client, :rocm, [platform: :rocm]}
State: :unused_state
Client #PID<0.316.0> is alive

    (stdlib 5.1.1) gen.erl:240: :gen.do_call/4
    (elixir 1.15.7) lib/gen_server.ex:1071: GenServer.call/3
    (exla 0.7.0-dev) lib/exla/defn.ex:268: EXLA.Defn.__compile__/4
    (nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
    (bumblebee 0.4.2) lib/bumblebee/diffusion/stable_diffusion.ex:243: Bumblebee.Diffusion.StableDiffusion.init/10
    (nx 0.7.0-dev) lib/nx/serving.ex:1803: anonymous fn/3 in Nx.Serving.Default.init/3
    (elixir 1.15.7) lib/enum.ex:4789: Enum.with_index_list/3
    (nx 0.7.0-dev) lib/nx/serving.ex:1799: Nx.Serving.Default.init/3

18:10:28.132 [error] Kino.listen with #Function<42.105768164/1 in :erl_eval.expr/6> failed with reason:

** (exit) exited in: GenServer.call(EXLA.Client, {:client, :rocm, [platform: :rocm]}, :infinity)
    ** (EXIT) an exception was raised:
        ** (RuntimeError) no supported devices found for platform ROCM
            (exla 0.7.0-dev) lib/exla/client.ex:196: EXLA.Client.unwrap!/1
            (exla 0.7.0-dev) lib/exla/client.ex:173: EXLA.Client.build_client/2
            (exla 0.7.0-dev) lib/exla/client.ex:136: EXLA.Client.handle_call/3
            (stdlib 5.1.1) gen_server.erl:1113: :gen_server.try_handle_call/4
            (stdlib 5.1.1) gen_server.erl:1142: :gen_server.handle_msg/6
            (stdlib 5.1.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
    (elixir 1.15.7) lib/gen_server.ex:1074: GenServer.call/3
    (exla 0.7.0-dev) lib/exla/defn.ex:268: EXLA.Defn.__compile__/4
    (nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
    (bumblebee 0.4.2) lib/bumblebee/diffusion/stable_diffusion.ex:243: Bumblebee.Diffusion.StableDiffusion.init/10
    (nx 0.7.0-dev) lib/nx/serving.ex:1803: anonymous fn/3 in Nx.Serving.Default.init/3
    (elixir 1.15.7) lib/enum.ex:4789: Enum.with_index_list/3
    (nx 0.7.0-dev) lib/nx/serving.ex:1799: Nx.Serving.Default.init/3

With Whisper I get

18:13:38.025 [info] StreamExecutor ROCM device (0) is of unsupported AMDGPU version : gfx1100. The supported AMDGPU versions are gfx900, gfx906, gfx908, gfx90a, gfx1030.

18:13:38.026 [error] GenServer EXLA.Client terminating
** (RuntimeError) no supported devices found for platform ROCM
    (exla 0.7.0-dev) lib/exla/client.ex:196: EXLA.Client.unwrap!/1
    (exla 0.7.0-dev) lib/exla/client.ex:173: EXLA.Client.build_client/2
    (exla 0.7.0-dev) lib/exla/client.ex:136: EXLA.Client.handle_call/3
    (stdlib 5.1.1) gen_server.erl:1113: :gen_server.try_handle_call/4
    (stdlib 5.1.1) gen_server.erl:1142: :gen_server.handle_msg/6
    (stdlib 5.1.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.329.0>): {:client, :rocm, [platform: :rocm]}
State: :unused_state
Client #PID<0.329.0> is alive

    (stdlib 5.1.1) gen.erl:240: :gen.do_call/4
    (elixir 1.15.7) lib/gen_server.ex:1071: GenServer.call/3
    (exla 0.7.0-dev) lib/exla/defn.ex:268: EXLA.Defn.__compile__/4
    (nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
    (bumblebee 0.4.2) lib/bumblebee/audio/speech_to_text_whisper.ex:63: anonymous fn/7 in Bumblebee.Audio.SpeechToTextWhisper.speech_to_text_whisper/5
    (nx 0.7.0-dev) lib/nx/serving.ex:1803: anonymous fn/3 in Nx.Serving.Default.init/3
    (elixir 1.15.7) lib/enum.ex:4789: Enum.with_index_list/3
    (nx 0.7.0-dev) lib/nx/serving.ex:1799: Nx.Serving.Default.init/3
18:13:38.028 [error] GenServer #PID<28339.147.0> terminating
** (stop) exited in: GenServer.call(EXLA.Client, {:client, :rocm, [platform: :rocm]}, :infinity)
    ** (EXIT) an exception was raised:
        ** (RuntimeError) no supported devices found for platform ROCM
            lib/exla/client.ex:196: EXLA.Client.unwrap!/1
            lib/exla/client.ex:173: EXLA.Client.build_client/2
            lib/exla/client.ex:136: EXLA.Client.handle_call/3
            (stdlib 5.1.1) gen_server.erl:1113: :gen_server.try_handle_call/4
            (stdlib 5.1.1) gen_server.erl:1142: :gen_server.handle_msg/6
            (stdlib 5.1.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message: {:DOWN, #Reference<28339.528120633.3359375361.126349>, :process, #PID<28339.146.0>, {{%RuntimeError{message: "no supported devices found for platform ROCM"}, [{EXLA.Client, :unwrap!, 1, [file: ~c"lib/exla/client.ex", line: 196, error_info: %{module: Exception}]}, {EXLA.Client, :build_client, 2, [file: ~c"lib/exla/client.ex", line: 173]}, {EXLA.Client, :handle_call, 3, [file: ~c"lib/exla/client.ex", line: 136]}, {:gen_server, :try_handle_call, 4, [file: ~c"gen_server.erl", line: 1113]}, {:gen_server, :handle_msg, 6, [file: ~c"gen_server.erl", line: 1142]}, {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 241]}]}, {GenServer, :call, [EXLA.Client, {:client, :rocm, [platform: :rocm]}, :infinity]}}}

I'm unsure if this is an upstream problem with XLA, but I can't find any issues referencing GFX1100 in the XLA repo, however there is this pull request https://github.com/openxla/xla/pull/2937

There's not really much we can do here :(

We're at the mercy of XLA upstream. This may also be a ROCm thing. When I used ROCm years ago there were a number of GPUs it did not support IIRC

You could try adding gfx1100 here:

https://github.com/elixir-nx/xla/blob/63f8d3a0056f0b9cf81b10f8dce5a3c82ef8bdd6/lib/xla.ex#L83

And then building from source by setting XLA_BUILD=1. If you run into environment issues you can alternatively use the Docker scripts (./build.sh rocm), then upload the binary somewhere and set XLA_ARCHIVE_URL accordingly.

You could try adding gfx1100 here:

https://github.com/elixir-nx/xla/blob/63f8d3a0056f0b9cf81b10f8dce5a3c82ef8bdd6/lib/xla.ex#L83

And then building from source by setting XLA_BUILD=1. If you run into environment issues you can alternatively use the Docker scripts (./build.sh rocm), then upload the binary somewhere and set XLA_ARCHIVE_URL accordingly.

I made that change and also added gfx1100 to https://github.com/openxla/xla/blob/a01af1af923cf66271c0f03b2962dff58068af5e/xla/stream_executor/device_description.h#L174 and https://github.com/openxla/xla/blob/a01af1af923cf66271c0f03b2962dff58068af5e/xla/stream_executor/device_description.h#L207 but unfortunately I get this error:


08:20:12.567 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

08:20:12.567 [info] XLA service 0x7f437c6272e0 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:

08:20:12.567 [info]   StreamExecutor device (0): Radeon RX 7900 XTX, AMDGPU ISA version: gfx1100

08:20:12.567 [info] Using BFC allocator.

08:20:12.567 [info] XLA backend allocating 23177723904 bytes on device 0 for BFCAllocator.

08:20:12.567 [error] INTERNAL: RET_CHECK failure (xla/pjrt/gpu/se_gpu_pjrt_client.cc:960) options.num_nodes == 1 || kv_get != nullptr 
*** Begin stack trace ***
    tsl::CurrentStackTrace[abi:cxx11]()

    xla::status_macros::MakeErrorStream::Impl::GetStatus()
    xla::GetStreamExecutorGpuClient(xla::GpuClientOptions const&)
    xla::GetStreamExecutorGpuClient(bool, xla::GpuAllocatorConfig const&, int, int, std::optional<std::set<int, std::less<int>, std::allocator<int> > > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, bool, std::function<absl::lts_20230802::StatusOr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (std::basic_string_view<char, std::char_traits<char> >, absl::lts_20230802::Duration)>, std::function<absl::lts_20230802::Status (std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >)>, bool)
    exla::GetGpuClient(double, bool, xla::GpuAllocatorConfig::Kind)
    get_gpu_client(enif_environment_t*, int, unsigned long const*)
    beam_jit_call_nif(process*, void const*, unsigned long*, unsigned long (*)(enif_environment_t*, int, unsigned long*), erl_module_nif*)

*** End stack trace ***

08:20:12.570 [error] GenServer EXLA.Client terminating
** (RuntimeError) RET_CHECK failure (xla/pjrt/gpu/se_gpu_pjrt_client.cc:960) options.num_nodes == 1 || kv_get != nullptr 
    (exla 0.7.0-dev) lib/exla/client.ex:196: EXLA.Client.unwrap!/1
    (exla 0.7.0-dev) lib/exla/client.ex:173: EXLA.Client.build_client/2
    (exla 0.7.0-dev) lib/exla/client.ex:136: EXLA.Client.handle_call/3
    (stdlib 5.1.1) gen_server.erl:1113: :gen_server.try_handle_call/4
    (stdlib 5.1.1) gen_server.erl:1142: :gen_server.handle_msg/6
    (stdlib 5.1.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.313.0>): {:client, :rocm, [platform: :rocm]}
State: :unused_state
Client #PID<0.313.0> is alive

    (stdlib 5.1.1) gen.erl:240: :gen.do_call/4
    (elixir 1.15.7) lib/gen_server.ex:1071: GenServer.call/3
    (exla 0.7.0-dev) lib/exla/defn.ex:268: EXLA.Defn.__compile__/4
    (nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
    (bumblebee 0.4.2) lib/bumblebee/text/fill_mask.ex:65: anonymous fn/7 in Bumblebee.Text.FillMask.fill_mask/3
    (nx 0.7.0-dev) lib/nx/serving.ex:1810: anonymous fn/3 in Nx.Serving.Default.init/3
    (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (nx 0.7.0-dev) lib/nx/serving.ex:1808: anonymous fn/3 in Nx.Serving.Default.init/3

08:20:12.576 [error] Kino.listen with #Function<42.105768164/1 in :erl_eval.expr/6> failed with reason:

** (exit) exited in: GenServer.call(EXLA.Client, {:client, :rocm, [platform: :rocm]}, :infinity)
    ** (EXIT) an exception was raised:
        ** (RuntimeError) RET_CHECK failure (xla/pjrt/gpu/se_gpu_pjrt_client.cc:960) options.num_nodes == 1 || kv_get != nullptr 
            (exla 0.7.0-dev) lib/exla/client.ex:196: EXLA.Client.unwrap!/1
            (exla 0.7.0-dev) lib/exla/client.ex:173: EXLA.Client.build_client/2
            (exla 0.7.0-dev) lib/exla/client.ex:136: EXLA.Client.handle_call/3
            (stdlib 5.1.1) gen_server.erl:1113: :gen_server.try_handle_call/4
            (stdlib 5.1.1) gen_server.erl:1142: :gen_server.handle_msg/6
            (stdlib 5.1.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
    (elixir 1.15.7) lib/gen_server.ex:1074: GenServer.call/3
    (exla 0.7.0-dev) lib/exla/defn.ex:268: EXLA.Defn.__compile__/4
    (nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
    (bumblebee 0.4.2) lib/bumblebee/text/fill_mask.ex:65: anonymous fn/7 in Bumblebee.Text.FillMask.fill_mask/3
    (nx 0.7.0-dev) lib/nx/serving.ex:1810: anonymous fn/3 in Nx.Serving.Default.init/3
    (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (nx 0.7.0-dev) lib/nx/serving.ex:1808: anonymous fn/3 in Nx.Serving.Default.init/3

EDIT: Hmm, this is failing on checking how many devices should be used (which should just be one). Even when I set XLA_TARGET=rocm or cpu, it still fails.

Just to double check, I reverted to this commit to make sure there weren't any breaking changes in XLA recently and I still get the same error.

Realized I was using a cached build that was still on HEAD. Deleted the cache, and now it works!

Sort of...

The first test of mask labeling with BERT works okay, but image generation first maxes out my GPU and then crashes. Even after the Elixir process is stopped, my GPU is still maxed.

For some reason, XLA is allocating all available VRAM on startup


11:42:51.320 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

11:42:51.320 [info] XLA service 0x7f3c34471930 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:

11:42:51.320 [info]   StreamExecutor device (0): Radeon RX 7900 XTX, AMDGPU ISA version: gfx1100

11:42:51.320 [info] Using BFC allocator.

11:42:51.321 [info] XLA backend allocating 23177723904 bytes on device 0 for BFCAllocator.

All text based tasks run fine. Audio and Image tasks always crash it seems.

EXLA preallocates ~90% of GPU memory as a bit of an optimization: https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html

You can disable it in your client configuration by setting preallocate: false or control the percent of allocation with memory_fraction: 0.75 e.g. for 75% preallocation.

It's possible in the case where your memory never frees up that their is a beam.smp process running in the background that never got cleaned up. Once the VM process dies the claimed memory should get cleaned up. Can you check if maybe the VM just did not shutdown at all even though the program crashed?

EXLA preallocates ~90% of GPU memory as a bit of an optimization: https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html

You can disable it in your client configuration by setting preallocate: false or control the percent of allocation with memory_fraction: 0.75 e.g. for 75% preallocation.

It's possible in the case where your memory never frees up that their is a beam.smp process running in the background that never got cleaned up. Once the VM process dies the claimed memory should get cleaned up. Can you check if maybe the VM just did not shutdown at all even though the program crashed?

I'm not able to reproduce the GPU lockup issue. I'm not entirely sure memory not getting cleared is an XLA or Elixir issue as I've occasionally run into the same issue with other applications. I think this is a ROCM driver thing.

The memory was released after a few minutes and now every time I try Image generation, it initially fills up memory, crashes, and then memory/GPU util goes back to 0%.

For preallocate, where do I set that?

https://github.com/elixir-nx/xla/assets/16871737/cf045c4d-13d3-4079-be91-1bae4018e145

Here's a screen recording of what happens

It seems there's a bug in EXLA when you change the default config options?

Mix.install(
  [
    {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
    {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
    {:kino, "~> 0.11.2"},
    {:bumblebee, "~> 0.4.2"},
    {:kino_bumblebee, "~> 0.4.0"}
  ],
  config: [
  nx: [default_backend: {EXLA.Backend, client: :rocm}],
  exla: [
      clients: [
        host: [platform: :host],
        rocm: [platform: :rocm, preallocate: false],
      ]
    ]
  ],
  system_env: %{
    "XLA_ARCHIVE_URL" =>
      "file:///home/clay/rocm_builds/xla_extension-x86_64-linux-gnu-rocm2.tar.gz",
    "XLA_TARGET" => "rocm"
  },
  force: true
)

Trying to turn off preallocate, I get


12:46:48.388 [error] Kino.listen with #Function<42.105768164/1 in :erl_eval.expr/6> failed with reason:

** (RuntimeError) unknown client :cuda given as :preferred_clients. If you plan to use :cuda or :rocm, make sure the XLA_TARGET environment variable is appropriately set. Currently it is set to "rocm"
    (exla 0.7.0-dev) lib/exla/client.ex:34: anonymous fn/3 in EXLA.Client.default_name/0
    (elixir 1.15.7) lib/enum.ex:4279: Enum.find_list/3
    (exla 0.7.0-dev) lib/exla/client.ex:31: EXLA.Client.default_name/0
    (elixir 1.15.7) lib/keyword.ex:1383: Keyword.pop_lazy/3
    (exla 0.7.0-dev) lib/exla/defn.ex:266: EXLA.Defn.__compile__/4
    (nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
    (bumblebee 0.4.2) lib/bumblebee/text/fill_mask.ex:65: anonymous fn/7 in Bumblebee.Text.FillMask.fill_mask/3

Mix.install(
  [
    {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
    {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
    {:kino, "~> 0.11.2"},
    {:bumblebee, "~> 0.4.2"},
    {:kino_bumblebee, "~> 0.4.0"}
  ],
  config: [
  nx: [default_backend: {EXLA.Backend, client: :rocm}],
  exla: [
      clients: [
        rocm: [platform: :rocm, preallocate: false],
      ],
      preferred_clients: [:rocm]
    ]
  ],
  system_env: %{
    "XLA_ARCHIVE_URL" =>
      "file:///home/clay/rocm_builds/xla_extension-x86_64-linux-gnu-rocm2.tar.gz",
    "XLA_TARGET" => "rocm"
  },
  force: true
)

This seems to work to disable preallocation. However, GPU util is pinned at 100% and it still crashes

https://github.com/elixir-nx/xla/issues/63#issuecomment-1817870171

So in theory GFX1100 should be supported now as of https://github.com/openxla/xla/pull/7197, however the latest upstream code fails with the error above

How much VRAM do you have? Do other models work? For image generation try num_images_per_prompt: 1, if you have more currently.

How much VRAM do you have? Do other models work? For image generation try num_images_per_prompt: 1, if you have more currently.

24GB. It seems something has changed between XLA versions. On the latest version I have to set https://github.com/elixir-nx/nx/blob/d15acedb63ec083736c40d1ac67f805a3c101b7c/exla/c_src/exla/exla_client.cc#L498 to xla::GetStreamExecutorGpuClient(false, allocator_config, 0, 1)); and then it initializes successfully.

I'm able to run pretty much everything dealing with text. However it crashes immediately when I try Whisper or Stable Diffusion. Setting num_images_per_prompt makes no difference.

In the logs I see this message on startup 2023-11-23 08:33:13.541439: E xla/stream_executor/plugin_registry.cc:90] Invalid plugin kind specified: DNN but nothing before/after it crashes.

2023-11-23 08:40:25.148429: E xla/stream_executor/plugin_registry.cc:90] Invalid plugin kind specified: DNN

08:40:42.339 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

08:40:42.339 [info] XLA service 0x7f89d4044d10 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:

08:40:42.339 [info]   StreamExecutor device (0): Radeon RX 7900 XTX, AMDGPU ISA version: gfx1100

08:40:42.339 [info] Using BFC allocator.

08:40:42.339 [info] XLA backend will use up to 23177723904 bytes on device 0 for BFCAllocator.

08:40:42.339 [info] number of nodes: 1

08:40:42.339 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

Ah, if it SEGFAULTs on Whisper and Stable Diffusion then it matches https://github.com/elixir-nx/xla/issues/58#issuecomment-1802226098. That's an upstream issue though. You can try putting it in elixir script and getting a core dump (some tips: https://github.com/elixir-nx/xla/issues/58#issuecomment-1801718310).

It seems something has changed between XLA versions. On the latest version I have to set

That's usually the case, they change things around quite a lot :)

Ah, if it SEGFAULTs on Whisper and Stable Diffusion then it matches #58 (comment). That's an upstream issue though. You can try putting it in elixir script and getting a core dump (some tips: #58 (comment)).

It seems something has changed between XLA versions. On the latest version I have to set

That's usually the case, they change things around quite a lot :)
09:18:44.335 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

09:18:44.336 [info] XLA service 0x7fba4c007220 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:

09:18:44.336 [info] StreamExecutor device (0): Radeon RX 7900 XTX, AMDGPU ISA version: gfx1100

09:18:44.336 [info] Using BFC allocator.

09:18:44.336 [info] XLA backend will use up to 23177723904 bytes on device 0 for BFCAllocator.

09:18:44.336 [info] number of nodes: 1

09:18:44.336 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero Segmentation fault (core dumped)


Ah, seems to be the case! 

So looks like Whisper and Stable Diffusion segfaulting is unresolved? That's unfortunate as these models run fine with pyTorch. I tried using the Torch backend with NX with not much luck. 

Anyways, thanks for your help!

Core was generated by `/usr/lib/erlang/erts-14.1.1/bin/beam.smp -- -root /usr/lib/erlang -bindir /usr/'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f1a2454f3a8 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, unsigned long, std::allocator<char> const&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
[Current thread is 1 (Thread 0x7f19b64fa640 (LWP 517642))]
(gdb) bt
#0  0x00007f1a2454f3a8 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, unsigned long, std::allocator<char> const&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007f1850a5c97d in miopen::kernels[abi:cxx11]() () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#2  0x00007f1850b0eb78 in miopen::GetKernelSrc(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
   from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#3  0x00007f18516d3315 in miopen::HIPOCProgramImpl::BuildCodeObject(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#4  0x00007f18516d2d27 in miopen::HIPOCProgramImpl::HIPOCProgramImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, miopen::TargetProperties const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#5  0x00007f18516d387b in miopen::HIPOCProgram::HIPOCProgram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, miopen::TargetProperties const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#6  0x00007f18516ceea5 in miopen::Handle::LoadProgram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
   from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#7  0x00007f1850d28225 in ?? () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#8  0x00007f1850d26beb in miopen::solver::PrecompileKernels(miopen::Handle const&, std::vector<miopen::solver::KernelInfo, std::allocator<miopen::solver::KernelInfo> > const&) () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#9  0x00007f1850d26f39 in miopen::solver::PrecompileSolutions(miopen::Handle const&, std::vector<miopen::solver::ConvSolution const*, std::allocator<miopen::solver::ConvSolution const*> > const&) () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#10 0x00007f18515a2603 in ?? () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#11 0x00007f18515b1945 in ?? () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#12 0x00007f185158b4d8 in miopen::ConvolutionDescriptor::FindConvFwdAlgorithm(miopen::Handle&, miopen::TensorDescriptor const&, void const*, miopen::TensorDescriptor const&, void const*, miopen::TensorDescriptor const&, void*, int, int*, miopenConvAlgoPerf_t*, void*, unsigned long, bool) const ()
   from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#13 0x00007f1850bd297c in miopenFindConvolutionForwardAlgorithm () from /opt/rocm-5.6.0/lib/libMIOpen.so.1
#14 0x00007f18c209c9d3 in stream_executor::gpu::MIOpenSupport::GetMIOpenConvolveAlgorithmsFindMode(stream_executor::dnn::ConvolutionKind, stream_executor::dnn::DataType, stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::ScratchAllocator*, std::vector<stream_executor::dnn::ProfileResult, std::allocator<stream_executor::dnn::ProfileResult> >*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#15 0x00007f18c20a0016 in stream_executor::gpu::MIOpenSupport::GetConvolveRunners(bool, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::DataType, stream_executor::dnn::DataType, stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::ConvolutionDescriptor const&, bool, stream_executor::ScratchAllocator*, stream_executor::NumericOptions const&, std::vector<std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> >, std::allocator<std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> > > >*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#16 0x00007f18c21324a3 in stream_executor::StreamExecutor::GetConvolveRunners(bool, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::DataType, stream_executor::dnn::DataType, stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::ConvolutionDescriptor const&, bool, stream_executor::ScratchAllocator*, stream_executor::NumericOptions const&, std::vector<std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> >, std::allocator<std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> > > >*) ()
--Type <RET> for more, q to quit, c to continue without paging--
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#17 0x00007f18be23be43 in xla::gpu::(anonymous namespace)::GetMIOpenAlgorithms(xla::HloCustomCallInstruction const*, absl::lts_20230802::Span<stream_executor::DeviceMemoryBase>, absl::lts_20230802::Span<stream_executor::DeviceMemoryBase>, stream_executor::StreamExecutor*, xla::gpu::(anonymous namespace)::ScratchAllocator*, stream_executor::Stream*, stream_executor::NumericOptions const&) [clone .constprop.0] ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#18 0x00007f18be23d43a in xla::gpu::GpuConvAlgorithmPicker::PickBestAlgorithmNoCacheRocm(xla::HloCustomCallInstruction const*, stream_executor::DeviceMemoryAllocator*, stream_executor::Stream*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#19 0x00007f18be23f4f0 in xla::gpu::GpuConvAlgorithmPicker::PickBestAlgorithmNoCache(xla::HloCustomCallInstruction const*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#20 0x00007f18be23f8f0 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::GpuConvAlgorithmPicker::PickBestAlgorithm(xla::HloCustomCallInstruction const*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#21 0x00007f18be244c28 in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#22 0x00007f18be23ae41 in xla::gpu::GpuConvAlgorithmPicker::PickBestAlgorithm(xla::HloCustomCallInstruction const*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#23 0x00007f18be23fcb2 in xla::gpu::GpuConvAlgorithmPicker::RunOnInstruction(xla::HloInstruction*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#24 0x00007f18be241777 in xla::gpu::GpuConvAlgorithmPicker::RunOnComputation(xla::HloComputation*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#25 0x00007f18be241a5b in xla::gpu::GpuConvAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char> >, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char> > > > const&) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#26 0x00007f18bef84747 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char> >, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char> > > > const&) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#27 0x00007f18bef8553b in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char> >, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char> > > > const&) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#28 0x00007f18bb8e7ac5 in xla::HloPassInterface::Run(xla::HloModule*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#29 0x00007f18bb8f05f5 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#30 0x00007f18bb8e14bf in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#31 0x00007f18bb8fbf9c in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#32 0x00007f18bb8ff827 in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule> >, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#33 0x00007f18bb8d0271 in xla::Service::BuildExecutable(xla::HloModuleProto const&, std::unique_ptr<xla::HloModuleConfig, std::default_delete<xla::HloModuleConfig> >--Type <RET> for more, q to quit, c to continue without paging--
, xla::Backend*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, bool) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#34 0x00007f18bb50c3ad in xla::LocalService::CompileExecutables(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#35 0x00007f18bb4fcbc0 in xla::LocalClient::Compile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&) () from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#36 0x00007f18bb4b78bc in xla::PjRtStreamExecutorClient::Compile(xla::XlaComputation const&, xla::CompileOptions) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#37 0x00007f18bb48c303 in xla::StreamExecutorGpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so
#38 0x00007f19b532ec6e in exla::ExlaClient::Compile(xla::XlaComputation const&, std::vector<xla::Shape*, std::allocator<xla::Shape*> >, xla::ExecutableBuildOptions&, bool) () from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/libexla.so
#39 0x00007f19b52ed784 in compile(enif_environment_t*, int, unsigned long const*) ()
   from /home/clay/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/e9b05ec726fde46f3adbe51a48ca79ef/_build/dev/lib/exla/priv/libexla.so
#40 0x0000564912175a78 in erts_call_dirty_nif ()
#41 0x00005649120358b6 in erts_dirty_process_main ()
#42 0x0000564911f908a6 in ?? ()
#43 0x0000564912239f03 in ?? ()
#44 0x00007f1a24094ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#45 0x00007f1a24126a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@jonatanklosko following the instructions from the other issue, here's the backtrace. I can provide the core dump as well, although it's quite large (1.9GB)

Awesome, perhaps we can minimise the example, here's from the other thread:

left = Nx.reshape(Nx.iota({9}), {1, 1, 3, 3})
right = Nx.reshape(Nx.iota({4}), {4, 1, 1, 1})
Nx.Defn.jit_apply(&Nx.conv/3, [left, right, []])

@seanmor5 is that stacktrace enough information for an upstream issue?

I think so yeah, it may be worth trying to increase the dirty NIF stack size first though unless you tried that already. We have to do this for all conv on NVidia GPUs

(ELIXIR_ERL_OPTIONS="+sssdio 128 +sssdcpu 128")

(ELIXIR_ERL_OPTIONS="+sssdio 128 +sssdcpu 128")

This works!! Specifically +sssdcpu 128.

Granted it takes like 5x longer than it should to produce an image, but it works now.

Perfect!

Granted it takes like 5x longer than it should to produce an image, but it works now.

Yeah, there are many optimisations we are yet to do for Stable Diffusion, tracked by https://github.com/elixir-nx/bumblebee/issues/147.

may i ask which xla version/rev works now with gfx1100 and elixir xla/exla? i get ** (RuntimeError) RET_CHECK failure (xla/pjrt/gpu/se_gpu_pjrt_client.cc:960) options.num_nodes == 1 || kv_get != nullptr

on 0eace6346026b51f8e069a0d670c49b3d4d23a79

may i ask which xla version/rev works now with gfx1100 and elixir xla/exla? i get ** (RuntimeError) RET_CHECK failure (xla/pjrt/gpu/se_gpu_pjrt_client.cc:960) options.num_nodes == 1 || kv_get != nullptr

on 0eace6346026b51f8e069a0d670c49b3d4d23a79

I had to patch EXLA to make it work. Specifically, I had to change https://github.com/elixir-nx/nx/blob/d94aa08c8cc5c926a8c08ef2832f8526fbf80cb2/exla/c_src/exla/exla_client.cc#L498 to

  EXLA_ASSIGN_OR_RETURN(std::unique_ptr<xla::PjRtClient> client,
    xla::GetStreamExecutorGpuClient(false, allocator_config, 0, 1));

Looks like I'm on revision b0ec7bafd525948f34804d5ad1d9c5939d1b0562

I think that's all I did to make it work.

Thanks @clayscode, I sent a PR for this here: https://github.com/elixir-nx/nx/pull/1407

@clayscode sorry for bothering you, could you provide the binary you built?

I am unable to reproduce the build

@clayscode sorry for bothering you, could you provide the binary you built?

I am unable to reproduce the build

Believe I built using Docker and ROCM 5.6. Unsure if 6.0 will work. I'll try to dig up the files some point this weekend.

elixir-nx / xla

XLA unsupported on GFX1100 #63