elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
83 stars 21 forks source link

Unable to run inside Docker on Macbook M1 #59

Closed Valian closed 8 months ago

Valian commented 8 months ago

First of all, thanks for an amazing work! Running my ML workload inside Elixir is amazingly simple.

For development I'm using Macbook M1 + Docker. Docker is quite useful for sharing development environment. I'm hitting a pretty weird error when installing EXLA in my setup. It was already reported in the separate thread https://github.com/elixir-nx/bumblebee/issues/237 but I think it fits better here.

When installing XLA, it fails. I managed to overcome compilation error, but then I hit another one causing a SEGFAULT. I've already spent quite some time trying to solve it using various approaches from the internet, but didn't yet found a solution.

Reproduction

docker run --rm -it elixir:1.15 bash`

iex

Mix.install(
  [{:bumblebee, "~> 0.4.2"}, {:exla, "~> 0.6.1"}],
  config: [nx: [default_backend: EXLA.Backend]]
)

output

image

Checking supported versions, indeed it's not there:

strings /usr/lib/aarch64-linux-gnu/libstdc++.so.6 | grep 'CXXABI'
CXXABI_1.3
CXXABI_1.3.1
CXXABI_1.3.2
CXXABI_1.3.3
CXXABI_1.3.4
CXXABI_1.3.5
CXXABI_1.3.6
CXXABI_1.3.7
CXXABI_1.3.8
CXXABI_1.3.9
CXXABI_1.3.10
CXXABI_1.3.11
CXXABI_1.3.12
CXXABI_TM_1

updating gcc helps for compilation error, but then another one happens:

echo "deb http://deb.debian.org/debian testing main" > /etc/apt/sources.list.d/testing.list \
    && apt-get update \
    && apt-get install -y libclang-13-dev clang-13 g++-13 gcc-13

iex

# now works 
Mix.install(
  [{:bumblebee, "~> 0.4.2"}, {:exla, "~> 0.6.1"}],
  config: [nx: [default_backend: EXLA.Backend]]
)

# but this fails
model_name = "thenlper/gte-small"
{:ok, model_info} = Bumblebee.load_model({:hf, model_name})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, model_name})
Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer)

text = "Cats are cute."
Nx.Serving.run(serving, text)

error:

[info] TfrtCpuClient created.
'+fp-armv8' is not a recognized feature for this target (ignoring feature)
'+lse' is not a recognized feature for this target (ignoring feature)
'+neon' is not a recognized feature for this target (ignoring feature)
'+crc' is not a recognized feature for this target (ignoring feature)
'+crypto' is not a recognized feature for this target (ignoring feature)
'+fp-armv8' is not a recognized feature for this target (ignoring feature)
'+lse' is not a recognized feature for this target (ignoring feature)
'+neon' is not a recognized feature for this target (ignoring feature)
'+crc' is not a recognized feature for this target (ignoring feature)
'+crypto' is not a recognized feature for this target (ignoring feature)
'+fp-armv8' is not a recognized feature for this target (ignoring feature)
'+lse' is not a recognized feature for this target (ignoring feature)
'+neon' is not a recognized feature for this target (ignoring feature)
'+crc' is not a recognized feature for this target (ignoring feature)
'+crypto' is not a recognized feature for this target (ignoring feature)
'+fp-armv8' is not a recognized feature for this target (ignoring feature)
'+lse' is not a recognized feature for this target (ignoring feature)
'+neon' is not a recognized feature for this target (ignoring feature)
'+crc' is not a recognized feature for this target (ignoring feature)
'+crypto' is not a recognized feature for this target (ignoring feature)

Any solution / workaround? Tried compiling XLA from source but it was not clear how to do it inside Docker (had troubles installing Bazel).

gfdfranco commented 8 months ago

Unfortunately, I'm encountering the same issue. It functions correctly on the MacBook Air M2, but when I attempt to run it in the application built in Docker, it crashes with the same error message :disappointed:

josevalim commented 8 months ago

I am wondering if this is not an emulation issue, because it works on my old Mac but now the new one. Maybe we could use an arm image? Is that possible?

Valian commented 8 months ago

I think it might be related to the way how XLA is compiled for arm platform. Docker on m1/m2 uses platform linux/arm64 if I'm not mistaken.

Tried emulating linux/amd64 (it's possible through docker cli flag, just works much slower) but had some other weird errors...

What's crazy it works just fine on macbook without docker or on linux with docker. Never before in my ~8yr old journey with Docker had such problems.

jonatanklosko commented 8 months ago

It looks like the cross-compiled binary we build on CI is not really reliable and the glibc it requires is definitely too high.

I adjusted Docker builds to work with arm and built the binary off-CI. I replaced the precompiled binary for 0.5.1, since it seems unusable anyway. It works in Docker on M1 for me, @Valian please try again and let me know if you can run it without issues now.

Valian commented 8 months ago

@jonatanklosko Amazing, everything is working now! 👍 🥇 Thank you!

jonatanklosko commented 8 months ago

Awesome, thanks for confirming :)

gfdfranco commented 8 months ago

Wow is working 🤘 thank you @jonatanklosko !