Closed Eiji7 closed 1 month ago
There are other issues about compiling ROCm which you can investigate. Unfortunately, those issues are really coming from Bazel, so there may not be much we can do from this project.
@josevalim For now I don't have any ideas, but I can work on my setup if you have some. I saw that not much people use ROCm
here, so I can do testing if you could guide me what can I do now.
Yeah, it's really Bazel and XLA. ROCm is definitely not as prioritized and widely used, so there seem to be more issues with getting the build environment right. I would try building the binary within Docker, see https://github.com/elixir-nx/xla/issues/63#issuecomment-1817744344.
@jonatanklosko may I ask how are you able to build the binary in docker?
I am trying to reproduce it in Linux machine using the provided Dockerfile and I get a ton of errors, I am able to solve some, but I reach a point where it seems I need to start modifying code in the libraries not only in the environment.
@jalberto interesting, the build itself doesn't require an actual GPU, so the Docker build should be reproducible. What kind of errors are you getting?
@jonatanklosko I tried in a clean env, with a new clone of the repo, I also remove build
and .cache
dir before each run and use build/build.sh rocm
:
1st error, easy to solve:
[2/2] STEP 19/21: COPY Makefile Makefile.win ./
Error: building at STEP "COPY Makefile Makefile.win ./": checking on sources under "/home/ja/Projects/Misc/tmp/xla": copier: stat: "/Makefile.win": no such file or directory
after that fix, we are in the correct path: Successfully tagged localhost/xla-rocm:latest
After a while:
[1,954 / 6,477] Compiling xla/service/gpu/runtime3/custom_call_thunk.cc; 4s local ... (16 actions, 15 running)
ERROR: /root/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/service/gpu/BUILD:1158:23: Compiling xla/service/gpu/cub_sort_kernel.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target //xla/service/gpu:cub_sort_kernel_f64) external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 100 arguments skipped)
Warning: HIP_PLATFORM=hcc is deprecated. Please use HIP_PLATFORM=amd.
clang++: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument]
Warning: HIP_PLATFORM=hcc is deprecated. Please use HIP_PLATFORM=amd.
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr43 = V_MOV_B32_dpp undef $vgpr43(tied-def 0), $vgpr4, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), killed $vgpr3, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr2, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr47 = V_MOV_B32_dpp undef $vgpr47(tied-def 0), $vgpr45, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr44 = V_MOV_B32_dpp undef $vgpr44(tied-def 0), $vgpr43, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr47 = V_MOV_B32_dpp undef $vgpr47(tied-def 0), $vgpr45, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr44 = V_MOV_B32_dpp undef $vgpr44(tied-def 0), $vgpr43, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr43 = V_MOV_B32_dpp undef $vgpr43(tied-def 0), $vgpr4, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr47 = V_MOV_B32_dpp undef $vgpr47(tied-def 0), $vgpr45, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr44 = V_MOV_B32_dpp undef $vgpr44(tied-def 0), $vgpr43, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr47 = V_MOV_B32_dpp undef $vgpr47(tied-def 0), $vgpr45, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr44 = V_MOV_B32_dpp undef $vgpr44(tied-def 0), $vgpr43, 322, 15, 15, 0, implicit $exec
12 errors generated when compiling for gfx1036.
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 150.258s, Critical Path: 42.81s
INFO: 1972 processes: 451 internal, 1521 local.
FAILED: Build did NOT complete successfully
make: *** [Makefile:26: /build/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-rocm.tar.gz] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
Then I changed HIP_PLAFORM
as indicated in the warning, and it can progress a bit more, until:
[3,920 / 6,478] Compiling xla/mlir_hlo/mhlo/transforms/legalize_to_linalg/legalize_to_linalg.cc; 17s local ... (16 actions, 15 running)
ERROR: /root/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/service/gpu/BUILD:1158:23: Compiling xla/service/gpu/cub_sort_kernel.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target //xla/service/gpu:cub_sort_kernel_f32) external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 100 arguments skipped)
clang++: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument]
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr42 = V_MOV_B32_dpp undef $vgpr42(tied-def 0), $vgpr4, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), killed $vgpr3, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr2, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr105 = V_MOV_B32_dpp undef $vgpr105(tied-def 0), $vgpr103, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr102 = V_MOV_B32_dpp undef $vgpr102(tied-def 0), $vgpr100, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr105 = V_MOV_B32_dpp undef $vgpr105(tied-def 0), $vgpr103, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr102 = V_MOV_B32_dpp undef $vgpr102(tied-def 0), $vgpr100, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr42 = V_MOV_B32_dpp undef $vgpr42(tied-def 0), $vgpr4, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr105 = V_MOV_B32_dpp undef $vgpr105(tied-def 0), $vgpr103, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr102 = V_MOV_B32_dpp undef $vgpr102(tied-def 0), $vgpr100, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr105 = V_MOV_B32_dpp undef $vgpr105(tied-def 0), $vgpr103, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr102 = V_MOV_B32_dpp undef $vgpr102(tied-def 0), $vgpr100, 322, 15, 15, 0, implicit $exec
12 errors generated when compiling for gfx1036.
[3,925 / 6,478] Compiling xla/mlir_hlo/mhlo/transforms/legalize_to_linalg/legalize_to_linalg.cc; 18s local ... (15 actions, 14 running)
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 597.807s, Critical Path: 85.17s
INFO: 3940 processes: 454 internal, 3486 local.
FAILED: Build did NOT complete successfully
make: *** [Makefile:26: /build/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-rocm.tar.gz] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
@jalberto ah yeah, the first error is because I removed the file and forget to update, I've just fixed on main. The build error is very confusing, I was suspecting the base image may have changed, but it hasn't. I can't think of anything else that could've changed since I built using that image :<
@jalberto I've just run build/build.sh rocm
on a fresh AWS amd64 instance with Ubuntu 20.04 and it run without failure. I'm wondering if the issue could be that you build on the machine with the actual GPU and the build somehow runs additional logic/checks because of that, but I'm really just guessing.
@jonatanklosko that could be, but I am not mounting any device, so the container has not access to /dev/dri
I will continue trying around, maybe is my system, but the main reason to use containers to build is to isolate from the host, so it is very odd
@Eiji7 you can try the new release and use ROCm 6.0, see https://github.com/elixir-nx/xla/issues/82#issuecomment-2124230058.
@jonatanklosko Oh, that's definitely interesting, however I would need to wait for Gentoo
maintainers first since version 6 is masked because of runtime issues, see:
# Patrick Lauer patrick@gentoo.org (2023-12-23) # ROCm-6 builds but has runtime issues for me
Source: gentoo/gentoo@563b5ab
Yeah, it looks like latest XLA requires 6.0+, so I think this ship has sailed on this side.
I don't think there's anything else we can do for 5.7, so I'm going to close this in favour of #82. Feel free to drop more comments if anything changes!
For what it's worth, IREE might be able to provide a way out. We're focusing on Metal support, but we just might get ROCm "for free"
Hi, I have
Gentoo Linux
with latest updates.I was fighting with
ROCm
support and ended up with this package set:with following
USE
flags forgcc
:and such environment variables:
Regardless of what I should and can install there are lots of weird problems:
TF_ROCM_AMDGPU_TARGETS
is set in code without a way to change it and is set to:"gfx900,gfx906,gfx908,gfx90a,gfx1030"
. Not only this builds support for manyGPUs
which rarely is important, but also I need to editxla
source code to support new cards (my usesgfx1100
)rocm_configure.bzl
only in theory supportsROCM_PATH
which is not/opt/rocm
or/opt/rocm-version
. In practice it forces some paths to be withinhip
androctracer
sub-directories which is not a case for installingROCm
packages in/usr
like:/usr/lib64/libamdhip64.so
. The file tries few path versions which is nice as long as it does not assumes putting a sub-directory. I would not be surprised if such sub-directory would have each case, but it's about 2 of 12 libsxla
does not specify a dependencies list - reading all of that error messages and not ending up with a working setup is truly exhausting :face_exhaling:The only know success builds are using old
gcc
versions which is a serious problem onprod
machinesmeanwhile
emerge
command returns:Of course nobody expects support of a
14.0.1_pre*
releases ofGCC
, but requiring at most 5 versions major versions back excluding even latest updates for9.x
branch is a critical issue for aprod
machines.Anyway, I have tried to use
GCC
version8.5
as well as13.2.1
withclang
version16
and17
, but none of them compiled successfully.Firstly the logs before fixing
rocm_configure.bzl
:After mentioned fix:
Somehow it does not detects properly the
gcc
. Surprisingly by default it's specific location is not in thePATH
variable:The final result is:
However the header files already existing within
gcc
installation:/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13
. What have surprised me is lots ofLoding:
lines without any other information. In last build attempt the number of such lines decreased to just 2. Maybe I still don't have 2 things found or installed?So far I was unmasking unsupported packages, compilling few configurations of
gcc
andclang
and even editing source files. I'm a bit tired today and it would be a big relief if somebody could help me with this environment setup. Have I missed something? Are newAMD
GPUs even supported? Or maybe there are other problems in source files? Maybe should I try some unreleased branches?Here are some information about my setup: