Closed victor23k closed 2 months ago
Which GCC do you have installed on the machine?
Version 11.4.0
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Can you try running the .so file directly OR ldd --version
?
$ ldd --version
ldd (Ubuntu GLIBC 2.35-0ubuntu3.7) 2.35
I was using iex to test out code changes, and after recompile
, I get a crash of the console, application and machine, without leaving trace. When I start the container back up, I get the compilation error that I showed on the first comment.
Thank you. I am trying to understand the problem as well and it seems that gcc, g++, glibc, glibc++ all have their own specific versions. Which values do you get from strings /lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
? Do you get anything more recent than GLIBCXX_3.4.32
? Could it be that your version is too old? Or a mismatch between glibc++ and g++ versions? https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html
Thanks for your help José. The last version of GLIBCXX that I get from executing that command is GLIBCXX_3.4.30
. So the version appears to be old.
After testing a few more things, I can confirm that:
exla
version.Failed to load NIF library
error.If the problem is that the GLIBCXX version is old, shouldn't it also fail on first compilation of the application?
This is confusing, I cannot reproduce it using Docker.
$ docker run --rm -it --platform hexpm/elixir:1.16.3-erlang-26.2.5-ubuntu-jammy-20240427
$ apt update && apt install wget build-essential
$ iex
iex> Mix.install([:exla])
Works as expected, and the glibc versions match what @victor23k posted.
We intentionally precompile the binaries on older Ubuntu (20) to make sure we use an older glibc for broader compatibility.
I'm also hitting this adding :exla
to an existing project, CI running on https://ubicloud.com on a ubicloud-standard-2
x86 machine.
I'm purging all caches and if that doesn't clear the problem I'll report back with additional details
Update: Purging the cache did not fix it
$ ldd --version
ldd (Ubuntu GLIBC 2.[3](https://github.com/tv-labs/platform/actions/runs/9393354427/job/25869319128?pr=1578#step:6:3)5-0ubuntu3.7) 2.35
$ gcc --version
gcc (Ubuntu 11.[4](https://github.com/tv-labs/platform/actions/runs/9393354427/job/25869319128?pr=1578#step:6:4).0-1ubuntu1~22.04) 11.4.0
I have an (unfounded) theory that this could be an architecture issue. Going to try to run on ARM
I think I found the source of my issue. I cache and compile dependencies in one job, then restore the cache and run the tests in another.
It appears that exla
is being stored in /home/runner/.cache/xla/exla/elixir-1.16.2-erts-14.2.4-xla-0.5.1-exla-0.6.4-nhpqdanj5ap2ccwbksuqlmr5zi/libexla.so
==> exla
Unpacking /home/runner/.cache/xla/0.5.1/cache/download/xla_extension-x86_64-linux-gnu-cpu.tar.gz into /home/runner/work/platform/platform/sidecar/deps/exla/cache
g++ -fPIC -I/home/runner/work/_temp/.setup-beam/otp/erts-14.2.4/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++17 -w -DLLVM_VERSION_STRING= c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/xla_extension/lib'
Caching libexla.so at /home/runner/.cache/xla/exla/elixir-1.16.2-erts-14.2.4-xla-0.5.1-exla-0.6.4-nhpqdanj5ap2ccwbksuqlmr5zi/libexla.so
In the subsequence job that runs mix test
, the built libexla.so
is not found
Run mix test
Generated sidecar app
Warning: 01:24:37.667 [warning] The on_load function for module Elixir.EXLA.NIF returned:
{:error,
{:load_failed,
~c"Failed to load NIF library: '/home/runner/work/platform/platform/sidecar/_build/test/lib/exla/priv/libexla.so: cannot open shared object file: No such file or directory'"}}
** (Mix) Could not start application exla: EXLA.Application.start(:normal, []) returned an error: shutdown: failed to start child: EXLA.Logger
** (EXIT) an exception was raised:
** (UndefinedFunctionError) function EXLA.NIF.start_log_sink/1 is undefined (module EXLA.NIF is not available)
(exla 0.6.4) EXLA.NIF.start_log_sink(#PID<0.331.0>)
(exla 0.6.4) lib/exla/logger.ex:[12](https://github.com/tv-labs/platform/actions/runs/9393551404/job/25869876958#step:7:13): EXLA.Logger.init/1
(stdlib 5.2.2) gen_server.erl:980: :gen_server.init_it/2
(stdlib 5.2.2) gen_server.erl:935: :gen_server.init_it/6
(stdlib 5.2.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
I'm not exactly sure what the EXLA build step is doing, if there's configuration there that needs to be changed, or if GHA cache is breaking this.
Ah, I was assuming that the glibc mismatch has to do with the XLA binary, but it may very well be the EXLA NIF, and that makes more sense.
When compiling EXLA NIF files, we cache the resulting libexla.so
globally (the heavy part is already precompiled and linked separately, but we still want to reuse the EXLA compilation across projects and environments for faster build). So, the original issue could happen if EXLA was first compiled in a Docker image with more recent glibc, cached, and then reused when using an image with older glibc.
@victor23k in your case, do you reuse ~/.cache/xla
? Perhaps you tried ubuntu:24.04
prior?
@davydog187 is the last error you posted the original one you run into? It is actually different, the NIF fails to load because the .so
is missing altogether, not because of glibc mismatch. I am confused though why the project-local .so
is gone, it is either copied from the global cache or compiled from scratch. Does anything unusual happen to project local cache/
before mix test
?
@jonatanklosko I took a look into @davydog187's issue yesterday, and in short the issue is due to multiple layers of "cache these results and copy over to the next step".
The root cause made me think a little bit, because the workflow assumed that deps
was a stateless dir. And usually it is, but for EXLA it is not, because of the cache
that holds libexla.so
.
@josevalim do you also have any thoughts on this?
@polvalente we need to write some Elixir code or makefile conditions that checks for those constraints and/or dependencies.
Another option is: we added this so you don't have to compile EXLA for dev/test/prod. So what we could do is to remove deps/exla/cache and instead, when running in dev/test/prod, we check the other environments to see if they have already built it. This way everything is in _build but we still get to copy it. Thoughts?
@josevalim given that we cache globally, wouldn't it be enough to either copy from the global cache to _build or let it compile otherwise?
We cache XLA globally but not EXLA, right?
We could cache EXLA globally but then we would need the same cache expiration mechanism (or versioned paths that include glibc) I mentioned above. It ends up being the same to me.
@josevalim we cache EXLA globally, #1016 :p
GAH. So yes, we need to include glibc or whatever as part of the path.
I think im missing something here, but doesn't the global cache in ~/.cache still have a problem in CI workflows like mine? Unless you know that directory exists and explicitly cache it, it will get lost in between jobs.
@davydog187 good call. I think that's somewhat a separate problem but we could make that path customizable if it isn't yet.
I found the root of the problem in my case. I have a volume defined in the docker container to share the whole project directory. This includes the _build
and deps
directory, and when doing any changes to the code from my editor, elixir-ls
would recompile whatever is needed. Since the host OS and the container OS don't match, and neither do the libraries required to compile exla
, compiling from the host would change the libexla.so
binary, compiled with a different glibc
version.
Excluding the _build
and deps
directories from the docker volume solved the problem for me.
Thanks for the help :heart:
Btw, the cache dir is configurable at XLA_CACHE_DIR
, in case someone wants to use it for CI. But caching the deps may be enough.
Description
Hi there! I'm having some trouble using the EXLA dependency. I'm trying to use
{:exla, "~> 0.7"}
. I have an application running inside a docker container, built from imageubuntu:22.04
. When I restart the container, I get the following error:Error message
This could make sense as running the following inside the container:
does not show
GLIBCXX_3.4.32
If I just change the version of the EXLA dependency in
mix.exs
to{:exla, "~> 0.6"}
, I don't get the error and after compiling the library the application starts fine. But after restarting the container again I get the same error and it can only be fixed by changing EXLA back to{:exla, "~> 0.7"}
. And then back to the loop.I don't know if I'm missing something or this is a bug, some help would be much appreciated :smiley: