elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
83 stars 21 forks source link

Build error with cuda11.8 #24

Closed masahiro-999 closed 1 year ago

masahiro-999 commented 1 year ago

I tried to build by the method described in the following document. I got a build error https://github.com/elixir-nx/xla/tree/main/.github/builds

$ diff -u Dockerfile.cuda111 Dockerfile.cuda118
--- Dockerfile.cuda111  2022-09-24 15:33:43.138428052 +0900
+++ Dockerfile.cuda118  2022-10-16 15:32:44.698932675 +0900
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04
+FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04

 # Set the missing utf-8 locales, otherwise Elixir warns
 ENV LANG en_US.UTF-8
$ docker build -t xla-cuda118 -f Dockerfile.cuda118 .
$ docker run -it -v $(pwd)/build:/build xla-cuda118
Cloning into 'xla'...
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 29 (delta 1), reused 15 (delta 0), pack-reused 0
Receiving objects: 100% (29/29), 19.79 KiB | 19.79 MiB/s, done.
Resolving deltas: 100% (1/1), done.
Resolving Hex dependencies...
Dependency resolution completed:
Unchanged:
  earmark_parser 1.4.15
  elixir_make 0.6.2
  ex_doc 0.25.2
  makeup 1.0.5
  makeup_elixir 0.15.1
  makeup_erlang 0.1.1
  nimble_parsec 1.1.0
* Getting elixir_make (Hex package)
* Getting ex_doc (Hex package)
* Getting earmark_parser (Hex package)
* Getting makeup_elixir (Hex package)
* Getting makeup_erlang (Hex package)
* Getting makeup (Hex package)
* Getting nimble_parsec (Hex package)
==> earmark_parser
Compiling 1 file (.yrl)
Compiling 2 files (.xrl)
Compiling 3 files (.erl)
Compiling 32 files (.ex)
Generated earmark_parser app
==> nimble_parsec
Compiling 4 files (.ex)
Generated nimble_parsec app
==> makeup
Compiling 44 files (.ex)
Generated makeup app
==> makeup_elixir
Compiling 6 files (.ex)
Generated makeup_elixir app
==> makeup_erlang
Compiling 3 files (.ex)
Generated makeup_erlang app
==> ex_doc
Compiling 26 files (.ex)
Generated ex_doc app
==> elixir_make
Compiling 1 file (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
mkdir -p /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e && \
        cd /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e && \
        git init && \
        git remote add origin https://github.com/tensorflow/tensorflow.git && \
        git fetch --depth 1 origin 3f878cff5b698b82eea85db2b60d65a2e320850e && \
        git checkout FETCH_HEAD && \
        rm /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelversion
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:
hint:   git config --global init.defaultBranch <name>
hint:
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:
hint:   git branch -m <name>
Initialized empty Git repository in /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.git/
From https://github.com/tensorflow/tensorflow
 * branch              3f878cff5b698b82eea85db2b60d65a2e320850e -> FETCH_HEAD
Note: switching to 'FETCH_HEAD'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 3f878cff Merge pull request #54226 from tensorflow-jenkins/version-numbers-2.8.0-22199
rm -f /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/compiler/xla/extension && \
        ln -s "/xla/extension" /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/compiler/xla/extension && \
        cd /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e && \
        bazel build --define "framework_shared_object=false" -c opt   --config=cuda //tensorflow/compiler/xla/extension:xla_extension && \
        mkdir -p /build/0.3.0/cache/build/ && \
        cp -f /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/bazel-bin/tensorflow/compiler/xla/extension/xla_extension.tar.gz /build/0.3.0/cache/build/xla_extension-x86_64-linux-cuda111.tar.gz
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc:
  'build' options: --define framework_shared_object=true --java_toolchain=@tf_toolchains//toolchains/java:tf_java_toolchain --host_java_toolchain=@tf_toolchains//toolchains/java:tf_java_toolchain --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/tfrt/common,tensorflow/core/tfrt/eager,tensorflow/core/tfrt/eager/backends/cpu,tensorflow/core/tfrt/eager/backends/gpu,tensorflow/core/tfrt/eager/core_runtime,tensorflow/core/tfrt/eager/cpp_tests/core_runtime,tensorflow/core/tfrt/fallback,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils
INFO: Found applicable config definition build:short_logs in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:cuda in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
INFO: Found applicable config definition build:linux in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --copt=-w --host_copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --config=dynamic_kernels --distinct_host_configuration=false --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/c3e082762b7664bbc7ffd2c39e86464928e27c0c.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found
INFO: Repository local_config_cuda instantiated at:
  /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/WORKSPACE:15:14: in <toplevel>
  /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/workspace2.bzl:878:19: in workspace
  /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/workspace2.bzl:96:19: in _tf_toolchains
Repository rule cuda_configure defined at:
  /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl:1448:33: in <toplevel>
ERROR: An error occurred during the fetch of repository 'local_config_cuda':
   Traceback (most recent call last):
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl", line 1401, column 38, in _cuda_autoconf_impl
                _create_local_cuda_repository(repository_ctx)
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl", line 1076, column 27, in _create_local_cuda_repository
                cuda_libs = _find_libs(repository_ctx, check_cuda_libs_script, cuda_config)
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl", line 606, column 21, in _find_libs
                _check_cuda_libs(repository_ctx, check_cuda_libs_script, check_cuda_libs_params.values())
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl", line 501, column 28, in _check_cuda_libs
                checked_paths = execute(repository_ctx, [python_bin, "-c", cmd]).stdout.splitlines()
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
Expected even number of arguments
ERROR: Error fetching repository: Traceback (most recent call last):
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl", line 1401, column 38, in _cuda_autoconf_impl
                _create_local_cuda_repository(repository_ctx)
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl", line 1076, column 27, in _create_local_cuda_repository
                cuda_libs = _find_libs(repository_ctx, check_cuda_libs_script, cuda_config)
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl", line 606, column 21, in _find_libs
                _check_cuda_libs(repository_ctx, check_cuda_libs_script, check_cuda_libs_params.values())
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/gpus/cuda_configure.bzl", line 501, column 28, in _check_cuda_libs
                checked_paths = execute(repository_ctx, [python_bin, "-c", cmd]).stdout.splitlines()
        File "/root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
Expected even number of arguments
INFO: Found applicable config definition build:cuda in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
ERROR: @local_config_cuda//:enable_cuda :: Error loading option @local_config_cuda//:enable_cuda: Repository command failed
Expected even number of arguments

Makefile:27: recipe for target '/build/0.3.0/cache/build/xla_extension-x86_64-linux-cuda111.tar.gz' failed
make: *** [/build/0.3.0/cache/build/xla_extension-x86_64-linux-cuda111.tar.gz] Error 2
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
jonatanklosko commented 1 year ago

Hey @masahiro-999! As mentioned in #23 we intentionally build on top of CUDA 11.1, we didn't test the build using other base Docker images, so unless we need to, I don't think it's an issue. We just need to address #23.