elixir-explorer / adbc

Apache Arrow ADBC bindings for Elixir
https://arrow.apache.org/adbc/
Apache License 2.0
50 stars 16 forks source link

ADBC, Phoenix and Docker build a container that waits forever #99

Closed CallMeSH closed 2 months ago

CallMeSH commented 2 months ago

Hello everyone! I've recently encountered an issue when using ADBC inside a Linux amd64 Docker container.

While applications build successfully, running the container results in an indefinite wait with no logs. I've attempted to request "more logs" from the Phoenix project, but without success. Interestingly, the container runs fine on an Apple computer with arm64 architecture. I've tried both building from an amd64 Linux machine in the cloud and cross-compiling from my Mac, but the issue persists. After exhausting various options, I'm turning to the issues to ask for your help .

I've prepared a sample project that reproduces the problem: https://github.com/CallMeSH/ADBC-Docker-sample-project

Here is how to get the problem, starting from a new Phoenix project:

  1. Create a new Phoenix project with mix phx.nex
  2. Setup the dockerfile/release with mix phx.gen.release --docker
  3. Add adbc as a dependency
  4. Build the container with docker build . --platform=linux/amd64
  5. The build will fail because cmake is missing, I add it to the apt-get phase
  6. The build will fail because of QEMU and OTP 25+ having issues together according to the elixir forums, adding the env variable ENV ERL_FLAGS="+JPperf true" in the dockerfile fixes the issue.
  7. The build succeeds, running the container now waits forever, removing adbc from the dependencies fixes the hang.

If anyone has ideas on how to investigate further, I'd be glad to pursue them. I apologize if I've overlooked something obvious. Thank you

josevalim commented 2 months ago

I haven't seen this before. My first comment would be to mention +JPperf, but you already tackled that. You also said that this fails when building the Docker image directly on the Linux/amd64 machine, right?

So the only next step I can think of is to run the application and build a release on a Linux/amd64 machine without using Docker at all. If the issue persists, then it is an ADBC issue (either here or the parent one). If it works without Docker, then it is something Docker or QEMU related.

CallMeSH commented 2 months ago

Thanks for your reply @josevalim,

Following your message:

So it seems very QEMU or Docker related...

josevalim commented 2 months ago

If you are building for the same architecture you are on, it should not use emulation. Assuming you are invoking the same commands, I can’t really explain why you are seeing different behaviors.

CallMeSH commented 2 months ago

Hum, so your comment made me think about this whole QEMU situation. Since I was on the cloud instance and not on my mac, QEMU was not involved. To make sur it was not, I removed from the dockerfile the JPERF env var. ENV ERL_FLAGS="+JPperf true". It built without Jperf so it definitely didn't run in a QEMU context. For good mesure I opted out of BUILDKIT as well, with the same result: it hangs.

Sooooo that must be something related to adbc in a docker context 🤔

EDIT: I noticed I can enter the BREAK mode when I CTRL-C into the -it mode of docker. So the BEAM is alive, I just don't know how I could gather useful info from there

EDIT 2: Found this promising bit in the proc info

y(1)     [standard_io,[<<"Failed to load nif: {:load_failed, ~c\"Failed to load NIF library: '/usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /app/_build/prod/rel/adbc_docker_sample_project/lib/adbc-0.6.0/priv/adbc_nif.so)'\"}">>,10]]
CallMeSH commented 2 months ago

Ok so after much investigation, here is what I found: The archived dependencies made with the the CI of this repo run on ubuntu:latest container from GitHub which is an Ubuntu 24.04. Ubuntu comes bundled with GLIBCXX_3.4.29, but the debian image included in the Dockerfile generated by Phoenix does not. By switching the base image from debian-bullseye to ubuntu-noble, I managed to run the project !

Should we update the documentation of adbc to mention somewhere that a certain version of the distributions is required to boot the app? (The issue itself can be considered closed, but I'd like to help in making the information more available 👍 )

josevalim commented 2 months ago

Great finding. Can you please try using debian "bookworm" and let us know if it works? Then please send a PR saying that it requires either ubuntu-noble or debian-bullseye. We will see if we can lower the requirement here.

We should also try to update it in Phoenix.

CallMeSH commented 2 months ago

I tried building on bookworm but the mix local.rebar --force step crashes on the image hexpm/elixir:1.17.2-erlang-26.0.2-debian-bookworm-20240812-slim

 > [builder  4/17] RUN mix local.hex --force &&   mix local.rebar --force:
0.496 =CRASH REPORT==== 19-Aug-2024::17:38:36.151062 ===
0.496   crasher:
0.496     initial call: application_master:init/4
0.496     pid: <0.86.0>
0.496     registered_name: []
0.496     exception exit: {bad_return,
0.496                      {{elixir,start,[normal,[]]},
0.496                       {'EXIT',
0.496                        {undef,
0.496                         [{erlang,nif_error,
0.496                           [undef],
0.496                           [{error_info,#{module => erl_erts_errors}}]},
0.496                          {prim_tty,isatty,1,
0.496                           [{file,"prim_tty.erl"},{line,1249}]},
0.496                          {elixir,start,2,[{file,"src/elixir.erl"},{line,63}]},
0.496                          {application_master,start_it_old,4,
0.496                           [{file,"application_master.erl"},{line,293}]}]}}}}
0.496       in function  application_master:init/4 (application_master.erl, line 142)
0.496     ancestors: [<0.85.0>]
0.496     message_queue_len: 1
0.496     messages: [{'EXIT',<0.87.0>,normal}]
0.496     links: [<0.85.0>,<0.44.0>]
0.496     dictionary: []
0.496     trap_exit: true
0.496     status: running
0.496     heap_size: 987
0.496     stack_size: 28
0.496     reductions: 220
0.496   neighbours:
0.496 
0.496 =INFO REPORT==== 19-Aug-2024::17:38:36.160739 ===
0.496     application: elixir
0.496     exited: {bad_return,
0.496                 {{elixir,start,[normal,[]]},
0.496                  {'EXIT',
0.496                      {undef,
0.496                          [{erlang,nif_error,
0.496                               [undef],
0.496                               [{error_info,#{module => erl_erts_errors}}]},
0.496                           {prim_tty,isatty,1,
0.496                               [{file,"prim_tty.erl"},{line,1249}]},
0.496                           {elixir,start,2,[{file,"src/elixir.erl"},{line,63}]},
0.496                           {application_master,start_it_old,4,
0.496                               [{file,"application_master.erl"},
0.496                                {line,293}]}]}}}}
0.496     type: temporary
0.496 
0.496 =INFO REPORT==== 19-Aug-2024::17:38:36.161959 ===
0.496     application: compiler
0.496     exited: stopped
0.496     type: temporary
0.496 
0.497 Runtime terminating during boot ({{badmatch,{error,{elixir,{_}}}},[{elixir,start_cli,0,[{_},{_}]},{init,start_em,1,[]},{init,do_boot,3,[]}]})
0.498 
0.498 Crash dump is being written to: erl_crash.dump...done
------
Dockerfile:31
--------------------
  30 |     # install hex + rebar
  31 | >>> RUN mix local.hex --force && \
  32 | >>>   mix local.rebar --force
  33 |     
--------------------

After looking into the runner-images of github, ubuntu:latest is in fact jammy, I just tried it and it works as well.

Here is the recap of what works or does not:

I think alpine should be fine since it uses libmuslc rather than glibc, but my pacman foo is not good enough to try it right now.

As I'm writing this, @cocoa-xu has already prepared a draft PR that will fix this issue I think: https://github.com/elixir-explorer/adbc/pull/100

josevalim commented 2 months ago

I will investigate why Elixir is not compiling on Debian but I think #100 will indeed address this. Thank you!