elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.27k stars 90 forks source link

EXLA.NIF.start_log_sink/1 Issue - Works on Ubuntu but not on MacBook M2 #237

Closed gfdfranco closed 10 months ago

gfdfranco commented 10 months ago

I've encountered an issue with EXLA when trying to run it on my MacBook with the M2 chip. While my application works flawlessly on an Ubuntu computer, it runs into problems on my MacBook due to an issue related to EXLA.NIF.start_log_sink/1. The error suggests that the EXLA.NIF module hasn't been compiled properly or is unavailable.

Error: (Mix) Could not start application exla: EXLA.Application.start(:normal, []) returned an error: shutdown: failed to start child: EXLA.Logger (EXIT) an exception was raised: ** (UndefinedFunctionError) function EXLA.NIF.start_log_sink/1 is undefined (module EXLA.NIF is not available) (exla 0.6.0) EXLA.NIF.start_log_sink(#PID<0.430.0>) (exla 0.6.0) lib/exla/logger.ex:12: EXLA.Logger.init/1 (stdlib 5.0.2) gen_server.erl:962: :gen_server.init_it/2 (stdlib 5.0.2) gen_server.erl:917: :gen_server.init_it/6 (stdlib 5.0.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3

I added config :nx, :default_backend, EXLA.Backend in my config.exs file.

Screenshot 2023-09-07 at 15 31 50

And added the packages in mix.exs:

Screenshot 2023-09-07 at 15 32 16

I'm running this inside a Docker container based on the Elixir:latest image. Despite the environment being containerized, there seems to be a difference in behavior between the MacBook M2 and the Ubuntu machine. If needed, I can share the Dockerfile or any other relevant configuration.

josevalim commented 10 months ago

Can you show the whole compilation and boot logs? Either EXLA is not compiling or you are running EXLA on a different OS architecture than the one you compiled on. The message you see is the cascading failure from not being able to load EXLA.

gfdfranco commented 10 months ago

Apologies for the inconvenience earlier. The problem was linked to this specific error:

{:error, {:load_failed, ~c"Failed to load NIF library .../_build/dev/lib/exla/priv/libexla: '/usr/lib/aarch64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.13' not found (required by /.../_build/dev/lib/exla/priv/xla_extension/lib/libxla_extension.so)'"}}

I resolved it by updating gcc.

I've now encountered another error, but I believe we can consider this particular issue resolved and close it.

Screenshot 2023-09-07 at 19 36 08

Valian commented 8 months ago

I'm hitting exactly the same problem. Using MacBook M1 Pro.

docker run --rm -it elixir:1.15 bash`

iex

Mix.install(
  [
    {:bumblebee, "~> 0.4.2"},
    {:exla, "~> 0.6.1"},
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

output

image

Checking supported versions, indeed it's not there:

strings /usr/lib/aarch64-linux-gnu/libstdc++.so.6 | grep 'CXXABI'
CXXABI_1.3
CXXABI_1.3.1
CXXABI_1.3.2
CXXABI_1.3.3
CXXABI_1.3.4
CXXABI_1.3.5
CXXABI_1.3.6
CXXABI_1.3.7
CXXABI_1.3.8
CXXABI_1.3.9
CXXABI_1.3.10
CXXABI_1.3.11
CXXABI_1.3.12
CXXABI_TM_1

@gfdfranco What were your steps to upgrade gcc? I tried apt-get update && apt-get upgrade gcc but it didn't worked.

gfdfranco commented 8 months ago

To solve "CXXABI_1.3.13 not found" I changed my dockerfile something like this @Valian : Dockerfile;

FROM elixir:latest

........ .......

RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
    echo $TZ > /etc/timezone && \
    apt-get update -y && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y \
    curl wget build-essential postgresql-client \
    python3 python3-pip python3-dev libsm6 libxext6 git zip unzip nano inotify-tools \
    screen sl ffmpeg libstdc++6 && \
    echo "deb http://deb.debian.org/debian testing main" > /etc/apt/sources.list.d/testing.list && \
    apt-get update -y && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y \
    libclang-13-dev clang-13 g++-13 gcc-13 && \
    update-alternatives --install /usr/bin/cc cc /usr/bin/clang-13 100 && \
    update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang++-13 100 && \
    update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 100 && \
    update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 100 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

....... .......

Valian commented 8 months ago

Thank you! It was really useful! 🥳

Minimal solution that I found:

echo "deb http://deb.debian.org/debian testing main" > /etc/apt/sources.list.d/testing.list \
    && apt-get update \
    && apt-get install -y gcc-13
Valian commented 8 months ago

Sadly it's not enough... What you said @gfdfranco helped with compilation (no errors anymore), but when trying to run any model it gives me a bunch of errors:

[info] TfrtCpuClient created.
'+fp-armv8' is not a recognized feature for this target (ignoring feature)
'+lse' is not a recognized feature for this target (ignoring feature)
'+neon' is not a recognized feature for this target (ignoring feature)
'+crc' is not a recognized feature for this target (ignoring feature)
'+crypto' is not a recognized feature for this target (ignoring feature)
'+fp-armv8' is not a recognized feature for this target (ignoring feature)
'+lse' is not a recognized feature for this target (ignoring feature)
'+neon' is not a recognized feature for this target (ignoring feature)
'+crc' is not a recognized feature for this target (ignoring feature)
'+crypto' is not a recognized feature for this target (ignoring feature)
'+fp-armv8' is not a recognized feature for this target (ignoring feature)
'+lse' is not a recognized feature for this target (ignoring feature)
'+neon' is not a recognized feature for this target (ignoring feature)
'+crc' is not a recognized feature for this target (ignoring feature)
'+crypto' is not a recognized feature for this target (ignoring feature)
'+fp-armv8' is not a recognized feature for this target (ignoring feature)
'+lse' is not a recognized feature for this target (ignoring feature)
'+neon' is not a recognized feature for this target (ignoring feature)
'+crc' is not a recognized feature for this target (ignoring feature)
'+crypto' is not a recognized feature for this target (ignoring feature)

even when using your snippet verbatim. I found your thread on Elixir forum and you had the same issue. Were you able to fix that problem?