elixir-nx / nx

Multi-dimensional arrays (tensors) and numerical definitions for Elixir
2.66k stars 194 forks source link

EXLA Failed to load NIF library #1490

Closed victor23k closed 2 months ago

victor23k commented 6 months ago

Description

Hi there! I'm having some trouble using the EXLA dependency. I'm trying to use {:exla, "~> 0.7"}. I have an application running inside a docker container, built from image ubuntu:22.04. When I restart the container, I get the following error:

Error message

2024-05-22 13:08:08.420 [warning] []   The on_load function for module Elixir.EXLA.NIF returned:
  {:error,
    {:load_failed,
      'Failed to load NIF library: \'/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32\' not found (required by /usr/src/app/_build/dev/lib/exla/priv/libexla.so)\''}}

2024-05-22 13:08:08.421 [notice] []    Application exla exited: EXLA.Application.start(:normal, []) returned an error: shutdown: failed to start child: EXLA.Logger
  ** (EXIT) an exception was raised:
      ** (UndefinedFunctionError) function EXLA.NIF.start_log_sink/1 is undefined (module EXLA.NIF is not available)
           (exla 0.7.2) EXLA.NIF.start_log_sink(#PID<0.423.0>)
           (exla 0.7.2) lib/exla/logger.ex:12: EXLA.Logger.init/1
           (stdlib 3.17) gen_server.erl:423: :gen_server.init_it/2
           (stdlib 3.17) gen_server.erl:390: :gen_server.init_it/6
           (stdlib 3.17) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

This could make sense as running the following inside the container:

strings /lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX

does not show GLIBCXX_3.4.32


If I just change the version of the EXLA dependency in mix.exs to {:exla, "~> 0.6"}, I don't get the error and after compiling the library the application starts fine. But after restarting the container again I get the same error and it can only be fixed by changing EXLA back to {:exla, "~> 0.7"}. And then back to the loop.

I don't know if I'm missing something or this is a bug, some help would be much appreciated :smiley:

josevalim commented 6 months ago

Which GCC do you have installed on the machine?

victor23k commented 6 months ago

Version 11.4.0

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
josevalim commented 6 months ago

Can you try running the .so file directly OR ldd --version?

victor23k commented 6 months ago
$ ldd --version
ldd (Ubuntu GLIBC 2.35-0ubuntu3.7) 2.35

More info on the issue that I just discovered

I was using iex to test out code changes, and after recompile, I get a crash of the console, application and machine, without leaving trace. When I start the container back up, I get the compilation error that I showed on the first comment.

josevalim commented 6 months ago

Thank you. I am trying to understand the problem as well and it seems that gcc, g++, glibc, glibc++ all have their own specific versions. Which values do you get from strings /lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX? Do you get anything more recent than GLIBCXX_3.4.32? Could it be that your version is too old? Or a mismatch between glibc++ and g++ versions? https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html

victor23k commented 6 months ago

Thanks for your help José. The last version of GLIBCXX that I get from executing that command is GLIBCXX_3.4.30. So the version appears to be old.

After testing a few more things, I can confirm that:

If the problem is that the GLIBCXX version is old, shouldn't it also fail on first compilation of the application?

jonatanklosko commented 6 months ago

This is confusing, I cannot reproduce it using Docker.

$ docker run --rm -it --platform hexpm/elixir:1.16.3-erlang-26.2.5-ubuntu-jammy-20240427
$ apt update && apt install wget build-essential
$ iex
iex> Mix.install([:exla])

Works as expected, and the glibc versions match what @victor23k posted.

We intentionally precompile the binaries on older Ubuntu (20) to make sure we use an older glibc for broader compatibility.

davydog187 commented 5 months ago

I'm also hitting this adding :exla to an existing project, CI running on https://ubicloud.com on a ubicloud-standard-2 x86 machine.

I'm purging all caches and if that doesn't clear the problem I'll report back with additional details

Update: Purging the cache did not fix it

$ ldd --version
ldd (Ubuntu GLIBC 2.[3](https://github.com/tv-labs/platform/actions/runs/9393354427/job/25869319128?pr=1578#step:6:3)5-0ubuntu3.7) 2.35

$ gcc --version
gcc (Ubuntu 11.[4](https://github.com/tv-labs/platform/actions/runs/9393354427/job/25869319128?pr=1578#step:6:4).0-1ubuntu1~22.04) 11.4.0

I have an (unfounded) theory that this could be an architecture issue. Going to try to run on ARM

davydog187 commented 5 months ago

I think I found the source of my issue. I cache and compile dependencies in one job, then restore the cache and run the tests in another.

It appears that exla is being stored in /home/runner/.cache/xla/exla/elixir-1.16.2-erts-14.2.4-xla-0.5.1-exla-0.6.4-nhpqdanj5ap2ccwbksuqlmr5zi/libexla.so

==> exla
Unpacking /home/runner/.cache/xla/0.5.1/cache/download/xla_extension-x86_64-linux-gnu-cpu.tar.gz into /home/runner/work/platform/platform/sidecar/deps/exla/cache
g++ -fPIC -I/home/runner/work/_temp/.setup-beam/otp/erts-14.2.4/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++17 -w -DLLVM_VERSION_STRING= c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/xla_extension/lib'
Caching libexla.so at /home/runner/.cache/xla/exla/elixir-1.16.2-erts-14.2.4-xla-0.5.1-exla-0.6.4-nhpqdanj5ap2ccwbksuqlmr5zi/libexla.so

In the subsequence job that runs mix test, the built libexla.so is not found

Run mix test
Generated sidecar app
Warning: 01:24:37.667 [warning] The on_load function for module Elixir.EXLA.NIF returned:
{:error,
 {:load_failed,
  ~c"Failed to load NIF library: '/home/runner/work/platform/platform/sidecar/_build/test/lib/exla/priv/libexla.so: cannot open shared object file: No such file or directory'"}}

** (Mix) Could not start application exla: EXLA.Application.start(:normal, []) returned an error: shutdown: failed to start child: EXLA.Logger
    ** (EXIT) an exception was raised:
        ** (UndefinedFunctionError) function EXLA.NIF.start_log_sink/1 is undefined (module EXLA.NIF is not available)
            (exla 0.6.4) EXLA.NIF.start_log_sink(#PID<0.331.0>)
            (exla 0.6.4) lib/exla/logger.ex:[12](https://github.com/tv-labs/platform/actions/runs/9393551404/job/25869876958#step:7:13): EXLA.Logger.init/1
            (stdlib 5.2.2) gen_server.erl:980: :gen_server.init_it/2
            (stdlib 5.2.2) gen_server.erl:935: :gen_server.init_it/6
            (stdlib 5.2.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3

I'm not exactly sure what the EXLA build step is doing, if there's configuration there that needs to be changed, or if GHA cache is breaking this.

jonatanklosko commented 5 months ago

Ah, I was assuming that the glibc mismatch has to do with the XLA binary, but it may very well be the EXLA NIF, and that makes more sense.

When compiling EXLA NIF files, we cache the resulting libexla.so globally (the heavy part is already precompiled and linked separately, but we still want to reuse the EXLA compilation across projects and environments for faster build). So, the original issue could happen if EXLA was first compiled in a Docker image with more recent glibc, cached, and then reused when using an image with older glibc.

@victor23k in your case, do you reuse ~/.cache/xla? Perhaps you tried ubuntu:24.04 prior?

@davydog187 is the last error you posted the original one you run into? It is actually different, the NIF fails to load because the .so is missing altogether, not because of glibc mismatch. I am confused though why the project-local .so is gone, it is either copied from the global cache or compiled from scratch. Does anything unusual happen to project local cache/ before mix test?

polvalente commented 5 months ago

@jonatanklosko I took a look into @davydog187's issue yesterday, and in short the issue is due to multiple layers of "cache these results and copy over to the next step".

The root cause made me think a little bit, because the workflow assumed that deps was a stateless dir. And usually it is, but for EXLA it is not, because of the cache that holds libexla.so.

@josevalim do you also have any thoughts on this?

josevalim commented 5 months ago

@polvalente we need to write some Elixir code or makefile conditions that checks for those constraints and/or dependencies.

Another option is: we added this so you don't have to compile EXLA for dev/test/prod. So what we could do is to remove deps/exla/cache and instead, when running in dev/test/prod, we check the other environments to see if they have already built it. This way everything is in _build but we still get to copy it. Thoughts?

jonatanklosko commented 5 months ago

@josevalim given that we cache globally, wouldn't it be enough to either copy from the global cache to _build or let it compile otherwise?

josevalim commented 5 months ago

We cache XLA globally but not EXLA, right?

josevalim commented 5 months ago

We could cache EXLA globally but then we would need the same cache expiration mechanism (or versioned paths that include glibc) I mentioned above. It ends up being the same to me.

jonatanklosko commented 5 months ago

@josevalim we cache EXLA globally, #1016 :p

josevalim commented 5 months ago

GAH. So yes, we need to include glibc or whatever as part of the path.

davydog187 commented 5 months ago

I think im missing something here, but doesn't the global cache in ~/.cache still have a problem in CI workflows like mine? Unless you know that directory exists and explicitly cache it, it will get lost in between jobs.

josevalim commented 5 months ago

@davydog187 good call. I think that's somewhat a separate problem but we could make that path customizable if it isn't yet.

victor23k commented 5 months ago

I found the root of the problem in my case. I have a volume defined in the docker container to share the whole project directory. This includes the _build and deps directory, and when doing any changes to the code from my editor, elixir-ls would recompile whatever is needed. Since the host OS and the container OS don't match, and neither do the libraries required to compile exla, compiling from the host would change the libexla.so binary, compiled with a different glibc version.

Excluding the _build and deps directories from the docker volume solved the problem for me.

Thanks for the help :heart:

josevalim commented 2 months ago

Btw, the cache dir is configurable at XLA_CACHE_DIR, in case someone wants to use it for CI. But caching the deps may be enough.