elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
83 stars 21 forks source link

Cannot build XLA with ROCM in the latest versions #58

Closed costaraphael closed 7 months ago

costaraphael commented 8 months ago

Hey folks!

First of all thanks for the great work supporting Elixir's ML ecosystem! <3

I'm trying to set up a demo of GPU distribution at my company, but whenever I try to compile xla 0.5.1 with XLA_BUILD=true and XLA_TARGET=rocm I get the following error:

==> xla
Compiling 2 files (.ex)
Generated xla app
rm -f /root/.cache/xla_extension/xla-b938cfdf2d4e9a5f69c494a316e92638c1a119ef/xla/extension && \
        ln -s "/livebook/test_app/deps/xla/extension" /root/.cache/xla_extension/xla-b938cfdf2d4e9a5f69c494a316e92638c1a119ef/xla/extension && \
        cd /root/.cache/xla_extension/xla-b938cfdf2d4e9a5f69c494a316e92638c1a119ef && \
        bazel build --define "framework_shared_object=false" -c opt   --config=rocm --action_env=HIP_PLATFORM=hcc //xla/extension:xla_extension && \
        mkdir -p /root/.cache/xla/0.5.1/cache/build/ && \
        cp -f /root/.cache/xla_extension/xla-b938cfdf2d4e9a5f69c494a316e92638c1a119ef/bazel-bin/xla/extension/xla_extension.tar.gz /root/.cache/xla/0.5.1/cache/build/xla_extension-x86_64-linux-gnu-rocm.tar.gz
ERROR: Config value 'rocm' is not defined in any .rc file
make: *** [Makefile:26: /root/.cache/xla/0.5.1/cache/build/xla_extension-x86_64-linux-gnu-rocm.tar.gz] Error 2
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla --force", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
==> test_app
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

I did some digging, and I think it is because the openxla version the Makefile is pointing to doesn't have the rocm configuration:

The rocm config was apparently added back in this commit: https://github.com/openxla/xla/commit/98b61978674cb548b1af95caaf35620554b161c5.

I tried forcing XLA to download the latest passing build of openxla (https://github.com/openxla/xla/commit/dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57) by editing the Makefile locally, but it failed with the following log:

Starting local Bazel server and connecting to it...
INFO: Reading 'startup' options from /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc: --windows_enable_symlinks
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc:
  'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility
INFO: Found applicable config definition build:short_logs in file /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:rocm in file /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc: --crosstool_top=@local_config_rocm//crosstool:toolchain --define=using_rocm_hipcc=true --define=tensorflow_mkldnn_contraction_kernel=0 --repo_env TF_NEED_ROCM=1 --config=no_tfrt
INFO: Found applicable config definition build:no_tfrt in file /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc: --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/ir,tensorflow/compiler/mlir/tfrt/ir/mlrt,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/mlrt,tensorflow/compiler/mlir/tfrt/tests/ir,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_jitrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/compiler/mlir/tfrt/transforms/mlrt,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/runtime_fallback/test,tensorflow/core/runtime_fallback/test/gpu,tensorflow/core/runtime_fallback/test/saved_model,tensorflow/core/runtime_fallback/test/testdata,tensorflow/core/tfrt/stubs,tensorflow/core/tfrt/tfrt_session,tensorflow/core/tfrt/mlrt,tensorflow/core/tfrt/mlrt/attribute,tensorflow/core/tfrt/mlrt/kernel,tensorflow/core/tfrt/mlrt/bytecode,tensorflow/core/tfrt/mlrt/interpreter,tensorflow/compiler/mlir/tfrt/translate/mlrt,tensorflow/compiler/mlir/tfrt/translate/mlrt/testdata,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/graph_executor,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils,tensorflow/core/tfrt/utils/debug,tensorflow/core/tfrt/saved_model/python,tensorflow/core/tfrt/graph_executor/python,tensorflow/core/tfrt/saved_model/utils
INFO: Found applicable config definition build:linux in file /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --copt=-Wno-error=unused-but-set-variable --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
Loading: 
DEBUG: /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/third_party/repo.bzl:132:14: 
Warning: skipping import of repository 'tf_runtime' because it already exists.
DEBUG: /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/third_party/repo.bzl:132:14: 
Warning: skipping import of repository 'llvm-raw' because it already exists.
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 
Loading: 0 packages loaded
Loading: 0 packages loaded
    currently loading: xla/extension
INFO: Repository local_config_rocm instantiated at:
  /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/WORKSPACE:19:15: in <toplevel>
  /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/workspace2.bzl:90:19: in workspace
  /root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/workspace2.bzl:624:19: in workspace
  /root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/workspace2.bzl:78:19: in _tf_toolchains
Repository rule rocm_configure defined at:
  /root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl:832:33: in <toplevel>
Loading: 0 packages loaded
    currently loading: xla/extension
ERROR: An error occurred during the fetch of repository 'local_config_rocm':
   Traceback (most recent call last):
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl", line 810, column 38, in _rocm_autoconf_impl
                _create_local_rocm_repository(repository_ctx)
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl", line 546, column 35, in _create_local_rocm_repository
                rocm_config = _get_rocm_config(repository_ctx, bash_bin)
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl", line 393, column 30, in _get_rocm_config
                config = find_rocm_config(repository_ctx)
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl", line 371, column 26, in find_rocm_config
                exec_result = execute(repository_ctx, [python_bin, repository_ctx.attr._find_rocm_config])
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
ERROR: MIOpen version file "None" not found
ERROR: /root/.cache/xla_extension/xla-dba73eb7c7c6dbc589f3fe3334cabcbdebd53e57/WORKSPACE:19:15: fetching rocm_configure rule //external:local_config_rocm: Traceback (most recent call last):
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl", line 810, column 38, in _rocm_autoconf_impl
                _create_local_rocm_repository(repository_ctx)
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl", line 546, column 35, in _create_local_rocm_repository
                rocm_config = _get_rocm_config(repository_ctx, bash_bin)
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl", line 393, column 30, in _get_rocm_config
                config = find_rocm_config(repository_ctx)
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/gpus/rocm_configure.bzl", line 371, column 26, in find_rocm_config
                exec_result = execute(repository_ctx, [python_bin, repository_ctx.attr._find_rocm_config])
        File "/root/.cache/bazel/_bazel_root/d366f579e16ea48eba76171a9c9ace02/external/tsl/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
ERROR: MIOpen version file "None" not found
ERROR: Skipping '//xla/extension:xla_extension': no such package '@local_config_rocm//rocm': Repository command failed
ERROR: MIOpen version file "None" not found
WARNING: Target pattern parsing failed.
ERROR: no such package '@local_config_rocm//rocm': Repository command failed
ERROR: MIOpen version file "None" not found
INFO: Elapsed time: 39.473s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
make: *** [Makefile:26: /root/.cache/xla/0.5.1/cache/build/xla_extension-x86_64-linux-gnu-rocm.tar.gz] Error 1

So my guess is that it's not going to be as simple as just pointing to the new version 😅

I'd love to help get this sorted, but I'll need some pointers of where to start looking.

jonatanklosko commented 7 months ago

It probably is not applicable after boot, but warning on specific env vars is probably too specific. Another approach in Livebook is to use System.put_env (for env vars that don't affect deps installation) and that wouldn't work either.