elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
85 stars 20 forks source link

xla_extension failed encountered when trying to use exla in a Docker container #90

Open jeryldev opened 1 month ago

jeryldev commented 1 month ago

I encounter xla_extension failed when I try to run exla while building a docker container. Here are some of the snippets from my Dockerfile:

ARG BUILDER_IMAGE="hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim"
ARG RUNNER_IMAGE="debian:bullseye-20210902-slim"

FROM ${BUILDER_IMAGE}

...

# install build dependencies
# https://github.com/elixir-nx/xla?tab=readme-ov-file#building-from-source
RUN apt-get update -y && apt-get install -y build-essential git apt-transport-https curl gnupg python3-pip gcc-9 g++-9 \
    && apt-get clean && rm -f /var/lib/apt/lists/*_*

RUN export CC=/usr/bin/gcc-9

# https://bazel.build/install/ubuntu#install-on-ubuntu
RUN curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor >bazel-archive-keyring.gpg
RUN mv bazel-archive-keyring.gpg /usr/share/keyrings
RUN echo "deb [arch=amd64 signed-by=/usr/share/keyrings/bazel-archive-keyring.gpg] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list
RUN apt-get update -y && apt-get install -y bazel-6.5.0
RUN ln -s /usr/bin/bazel-6.5.0 /usr/bin/bazel

RUN pip install numpy

...

I get this error after I run the Dockerfile

[4,467 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 134s local ... (16 actions, 15 running)
[4,468 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 136s local ... (16 actions running)
[4,469 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 137s local ... (16 actions running)
[4,470 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 139s local ... (16 actions running)
[4,470 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 210s local ... (16 actions running)
ERROR: /home/user/.cache/bazel/_bazel_user/ee4c0f1833dfaa435cb867c88f5a190e/external/llvm-project/mlir/BUILD.bazel:4925:11: Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp failed: (Exit 1): gcc failed: error executing command (from target @llvm-project//mlir:LLVMDialect) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 85 arguments skipped)
gcc: fatal error: Killed signal terminated program cc1plus
compilation terminated.
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
[4,487 / 5,843] checking cached actions
INFO: Elapsed time: 1131.980s, Critical Path: 278.37s
INFO: 4487 processes: 343 internal, 4144 local.
FAILED: Build did NOT complete successfully
make: *** [Makefile:26: /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz] Error 1
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
==> lai
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

I only encounter this issue when trying to build a docker container. I do not encounter any issues when I run mix phx.server. Do we have an official Dockerfile sample for cases where docker container setup is required?

jonatanklosko commented 1 month ago

Is there a reason you are trying to build XLA from source, rather than use the the precompiled binaries?

We use these dockerfiles for precompilation, so those instructions should work.

jeryldev commented 1 month ago

Ideally, we would prefer not to build the extension from source. I noticed that the xla gets built from source when we add exla in our dependencies. Here are the dependencies we've added along with exla:

      {:bumblebee, "~> 0.5.3"},
      {:nx, "~> 0.7.3"},
      {:exla, "~> 0.7.3"},
      {:explorer, "~> 0.9.0"}

We did not add the xla dependency in our list of dependencies, but somehow, it gets added (maybe because it's part of Nx). Do you have a sample Dockerfile which we could use as basis when using Bumblebee, Nx, and Exla, without the triggering the building of XLA from source? Our main goal for now is to be able to run Nx and Exla in a docker container. 👍

josevalim commented 1 month ago

By default it will download a precompiled version. Does it print anything saying it can't use a precompiled and therefore it must compile from source?

jeryldev commented 1 month ago

I think it did. Here are some screenshots from today after removing the precompile steps in my Dockerfile

image

image

josevalim commented 1 month ago

So you have XLA_BUILD set by any chance?

jeryldev commented 1 month ago

I did not set it anywhere (.bashprofile, Dockerfile etc). Based on the README.md it is set to false by default.

jonatanklosko commented 1 month ago

The build should trigger only when XLA_BUILD is set, otherwise it either downloads a precompiled binary or, if not available, raises an error.

One way to check would be to add RUN [ -z "$XLA_BUILD" ] || exit 1 before the compilation step and see if it goes on.

polvalente commented 1 month ago

I did notice the image uses a rather outdated combo of Elixir and OTP, as well as an older Debian. If possible, I'd update to eliminate any possibility of the compilation being triggered by not finding the proper version/platform precompiled archive

jeryldev commented 1 month ago

It still went through 🥲

[+] Building 2.7s (14/14) FINISHED                                                                                      docker:default
 => [api internal] load build definition from Dockerfile                                                                          0.0s
 => => transferring dockerfile: 4.75kB                                                                                            0.0s
 => [api internal] load metadata for docker.io/hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim                    2.0s
 => [api auth] hexpm/elixir:pull token for registry-1.docker.io                                                                   0.0s
 => [api internal] load .dockerignore                                                                                             0.0s
 => => transferring context: 1.31kB                                                                                               0.0s
 => [api 1/8] FROM docker.io/hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim@sha256:02ed2d3f2e0360821017751464a6  0.0s
 => CACHED [api 2/8] RUN addgroup --gid 1000 user &&     adduser --disabled-password --ingroup user --uid 1000 user               0.0s
 => CACHED [api 3/8] RUN apt-get update -y && apt-get install -y build-essential git curl     && apt-get clean && rm -f /var/lib  0.0s
 => CACHED [api 4/8] RUN mkdir -p /home/user/app &&     sh -c "git config --global url."https://${GITHUB_API_TOKEN}@github.com/"  0.0s
 => CACHED [api 5/8] WORKDIR /home/user/app                                                                                       0.0s
 => CACHED [api 6/8] RUN mix local.hex --force &&     mix local.rebar --force                                                     0.0s
 => CACHED [api 7/8] RUN mix do local.hex --force, local.rebar --force                                                            0.0s
 => [api 8/8] RUN [ -z "$XLA_BUILD" ] || exit 1                                                                                   0.4s
 => [api] exporting to image                                                                                                      0.1s
 => => exporting layers                                                                                                           0.1s
 => => writing image sha256:4af8189528e21cd493cfe8a2b41e0303905e614e6fe1526f3ceab03627094dab                                      0.0s
 => => naming to docker.io/library/lai-service-api                                                                                0.0s
 => [api] resolving provenance for metadata file                                                                                  0.0s
WARN[0000] /home/jde/code/la/lai-service/docker-compose.yaml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Creating 2/2
 ✔ Network lai-service_default  Created                                                                                           0.1s 
 ✔ Container lai-service-db-1   Created                                                                                           0.2s 
[+] Running 1/1
 ✔ Container lai-service-db-1  Started                                                                                            0.5s 
Resolving Hex dependencies...
Resolution completed in 0.753s
Unchanged:
  aws_rds_castore 1.2.0
  aws_signature 0.3.2
  axon 0.6.1
  bumblebee 0.5.3

.....

===> Analyzing applications...
===> Compiling telemetry
===> Analyzing applications...
===> Compiling telemetry_poller
===> Analyzing applications...
===> Compiling certifi
===> Analyzing applications...
===> Compiling hackney
==> xla
Compiling 2 files (.ex)
Generated xla app
mkdir -p /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
        cd /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
        git init && \
        git remote add origin https://github.com/openxla/xla.git && \
        git fetch --depth 1 origin 771e38178340cbaaef8ff20f44da5407c15092cb && \
        git checkout FETCH_HEAD && \
        rm /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelversion
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint: 
hint:   git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint:   git branch -m <name>
Initialized empty Git repository in /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.git/
warning: redirecting to https://github.com/openxla/xla.git/
From https://github.com/openxla/xla
 * branch            771e38178340cbaaef8ff20f44da5407c15092cb -> FETCH_HEAD
Note: switching to 'FETCH_HEAD'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 771e381 [XLA:GPU] Check tensor_float_32_execution_enabled() in Triton codegen too
rm -f /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && \
        ln -s "/home/user/app/deps/xla/extension" /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && \
        cd /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
        bazel build --define "framework_shared_object=false" -c opt    //xla/extension:xla_extension && \
        mkdir -p /home/user/.cache/xla/0.6.0/cache/build/ && \
        cp -f /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/bazel-bin/xla/extension/xla_extension.tar.gz /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz
/bin/sh: 4: bazel: not found
make: *** [Makefile:26: /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz] Error 127
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
==> lai
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
jonatanklosko commented 1 month ago

Interesting, I don't have any idea at the moment. It would be helpful if you could minimize it into a reproducible repo, like an empty mix project with the deps and the Dockerfile :)