elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
83 stars 21 forks source link

Support for ROCM 6 #82

Open jalberto opened 1 month ago

jalberto commented 1 month ago

It seems ROCM 5.6 kind of works, but it really requires too much back and forth to have everything working, the new Fedora 40 brings official ROCM support but starting in ROCM 6.

I am using this config from https://github.com/elixir-nx/xla/issues/63

Mix.install(
  [
    {:web_driver_client, "~> 0.2.0"},
    {:kino, "~> 0.12.3"},
    {:req, "~> 0.4.14"},
    {:erlexec, "~> 2.0"},
    {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
    {:exla, github: "elixir-nx/nx", sparse: "exla", override: true}
  ],
  system_env: %{
    "XLA_ARCHIVE_URL" =>
      "https://static.jonatanklosko.com/builds/0.6.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz",
    "ROCM_PATH" => "/usr/lib64/rocm/"
  },
  config: [nx: [default_backend: {EXLA.Backend, client: :host}]]

I managed to find every pkgs it was asking for (this took a while of back and forth) until I reached this:

18:36:37.767 [warning] The on_load function for module Elixir.EXLA.NIF returned:
{:error,
 {:load_failed,
  ~c"Failed to load NIF library /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/f3927a87654a1bf097d7e31b6277a9f8/_build/dev/lib/exla/priv/libexla: 'librocblas.so.3: cannot open shared object file: No such file or directory'"}}

My guess is xla_extension needs to be built for rocm 7 (librocblas.s0.4), I tried to build it myself but the requirements are too way off the current system (gcc versions and so on)

Will be great if there were official xla binaries for different ROCM versions, as there are for CUDA.

I understand ROCM support is in low priority, but it is really nice for start in AI as it works nicely in linux

jalberto commented 1 month ago

I am also trying to reproduce the build by using the provided dockerfiles, but I always get errors:

[3,765 / 6,478] Compiling mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp; 22s local ... (16 actions, 15 running)
ERROR: /app/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/service/gpu/BUILD:1158:23: Compiling xla/service/gpu/cub_sort_kernel.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target //xla/service/gpu:cub_sort_kernel_u32) external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 100 arguments skipped)
clang++: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument]
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr41 = V_MOV_B32_dpp undef $vgpr41(tied-def 0), $vgpr4, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), killed $vgpr3, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr2, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr42 = V_MOV_B32_dpp undef $vgpr42(tied-def 0), $vgpr8, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
12 errors generated when compiling for gfx1036.
Target //xla/extension:xla_extension failed to build
jonatanklosko commented 1 month ago

Did you try building by setting the XLA revision as in https://github.com/elixir-nx/xla/issues/63#issuecomment-1844195261?

Setting up the right environment for building was an issue before, that's why we have the Dockerfile. I don't know about ROCM 6, my best bet would be on updating to newer XLA could fix the build, but that usually involves changes to EXLA too. I think it would be a good idea to update sometime soon anyway, but no guarantees.

You could perhaps use Docker with 5.6 for computations/experimentation altogether, though I get it's not very convenient.

jonatanklosko commented 1 month ago

@jalberto I updated to the latest XLA revision and EXLA main already uses that. I tried building with ROCm 5.7, but there were errors indicating that XLA already assumes 6.0 (using symbols defined in 6.0+). So I updated the Docker image and managed to successfully build with ROCm 6.0.

Please try XLA_ARCHIVE_URL=https://static.jonatanklosko.com/builds/0.7.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz and nx/exla main. If it doesn't work, you can also try building locally.

jalberto commented 1 month ago

thanks, @jonatanklosko will test and report back

jalberto commented 3 weeks ago

@jonatanklosko sorry for the delay, now I have a different error:

: CommandLine Error: Option 'x86-disable-avoid-SFB' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
jonatanklosko commented 3 weeks ago

@jalberto is it when loading the precompiled binary or during build?

jalberto commented 3 weeks ago

image

That is what happens when I try to rebuild without cache, and the LLVM error is in the console when I start the livebook server

jalberto commented 3 weeks ago

@jonatanklosko in case it helps:

* Getting nx (https://github.com/elixir-nx/nx.git - origin/main)
remote: Enumerating objects: 22709, done.        
remote: Counting objects: 100% (4025/4025), done.        
remote: Compressing objects: 100% (780/780), done.        
remote: Total 22709 (delta 3456), reused 3661 (delta 3202), pack-reused 18684        
* Getting exla (https://github.com/elixir-nx/nx.git - origin/main)
remote: Enumerating objects: 22709, done.        
remote: Counting objects: 100% (4047/4047), done.        
remote: Compressing objects: 100% (776/776), done.        
remote: Total 22709 (delta 3480), reused 3687 (delta 3228), pack-reused 18662        
Resolving Hex dependencies...
Resolution completed in 0.126s
New:
  castore 1.0.7
  certifi 2.12.0
  complex 0.5.0
  elixir_make 0.8.4
  erlexec 2.0.6
  finch 0.18.0
  fss 0.1.1
  hackney 1.20.1
  hpax 0.2.0
  idna 6.1.1
  jason 1.4.1
  kino 0.12.3
  metrics 1.0.1
  mime 2.0.5
  mimerl 1.3.0
  mint 1.6.0
  nimble_options 1.1.1
  nimble_ownership 0.3.1
  nimble_pool 1.1.0
  parse_trans 3.4.1
  req 0.4.14
  ssl_verify_fun 1.1.7
  table 0.1.2
  telemetry 1.2.1
  tesla 1.9.0
  unicode_util_compat 0.7.0
  web_driver_client 0.2.0
  xla 0.7.0
* Getting web_driver_client (Hex package)
* Getting kino (Hex package)
* Getting req (Hex package)
* Getting erlexec (Hex package)
* Getting telemetry (Hex package)
* Getting xla (Hex package)
* Getting elixir_make (Hex package)
* Getting nimble_pool (Hex package)
* Getting complex (Hex package)
* Getting finch (Hex package)
* Getting jason (Hex package)
* Getting mime (Hex package)
* Getting nimble_ownership (Hex package)
* Getting castore (Hex package)
* Getting mint (Hex package)
* Getting nimble_options (Hex package)
* Getting hpax (Hex package)
* Getting fss (Hex package)
* Getting table (Hex package)
* Getting hackney (Hex package)
* Getting tesla (Hex package)
* Getting certifi (Hex package)
* Getting idna (Hex package)
* Getting metrics (Hex package)
* Getting mimerl (Hex package)
* Getting parse_trans (Hex package)
* Getting ssl_verify_fun (Hex package)
* Getting unicode_util_compat (Hex package)
==> table
Compiling 5 files (.ex)
Generated table app
==> mime
Compiling 1 file (.ex)
Generated mime app
==> nimble_options
Compiling 3 files (.ex)
Generated nimble_options app
===> Analyzing applications...
===> Compiling unicode_util_compat
===> Analyzing applications...
===> Compiling idna
===> Analyzing applications...
===> Compiling telemetry
==> jason
Compiling 10 files (.ex)
Generated jason app
==> hpax
Compiling 4 files (.ex)
Generated hpax app
===> Analyzing applications...
===> Compiling mimerl
==> ssl_verify_fun
Compiling 7 files (.erl)
Generated ssl_verify_fun app
==> fss
Compiling 4 files (.ex)
Generated fss app
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 35 files (.ex)
Generated nx app
==> kino
Compiling 47 files (.ex)
Generated kino app
===> Analyzing applications...
===> Compiling certifi
===> Analyzing applications...
===> Compiling parse_trans
==> nimble_pool
Compiling 2 files (.ex)
Generated nimble_pool app
===> Fetching rebar3_hex v7.0.7
===> Fetching hex_core v0.8.4
===> Fetching verl v1.1.1
===> Analyzing applications...
===> Compiling hex_core
===> Compiling verl
===> Compiling rebar3_hex
===> Fetching rebar3_ex_doc v0.2.22
===> Analyzing applications...
===> Compiling rebar3_ex_doc
make: Entering directory '/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/c_src'
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2   -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include  -c -o ei++.o ei++.cpp
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2   -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include  -c -o exec.o exec.cpp
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2   -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include  -c -o exec_impl.o exec_impl.cpp
mkdir -p /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/
mkdir -p "/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/"
g++ ei++.o exec.o exec_impl.o -L/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/lib -lei -o /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/exec-port
make: Leaving directory '/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/c_src'
===> Analyzing applications...
===> Compiling erlexec
===> Analyzing applications...
===> Compiling metrics
===> Analyzing applications...
===> Compiling hackney
==> castore
Compiling 1 file (.ex)
Generated castore app
==> elixir_make
Compiling 8 files (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
==> exla
Unpacking /home/ja/.cache/xla/0.7.0/cache/external/xla_extension-4j534fd5eueir3oelhrj2pvadm.tar.gz into /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/exla/exla/cache
Using libexla.so from /home/ja/.cache/xla/exla/elixir-1.16.2-erts-14.2.5-xla-0.7.0-exla-0.7.1-4hm2i3sdtzvi2nwhnlfl4jx27u/libexla.so
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla.cc -o cache/objs/exla.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_mlir.cc -o cache/objs/exla_mlir.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/custom_calls.cc -o cache/objs/custom_calls.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_client.cc -o cache/objs/exla_client.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_cuda.cc -o cache/objs/exla_cuda.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_nif_util.cc -o cache/objs/exla_nif_util.o
g++ cache/objs/exla.o cache/objs/exla_mlir.o cache/objs/custom_calls.o cache/objs/exla_client.o cache/objs/exla_nif_util.o cache/objs/exla_cuda.o -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -shared -Wl,-rpath,'$ORIGIN/xla_extension/lib'
Compiling 23 files (.ex)
jonatanklosko commented 3 weeks ago

As a sanity check, try without XLA_ARCHIVE_URL, which by default should download just the CPU-enabled binary. This way we will know if it is specific to the ROCm binary. Make sure to reinstall without cache.

jalberto commented 3 weeks ago

yes, that worked as expected, no issues

As a side note: I have same issues building with the new dockerfile

jonatanklosko commented 3 weeks ago

I see, I have no idea where this LLVM error is coming from, I didn't find x86-disable-avoid-SFB, nor X86AvoidStoreForwardingBlocks in openxla/xla source mentioned explicitly. You can try building youtself with XLA_BUILD=1 just in case, but that's a long shot (and provided that it builds without issues) :<