elixir-nx / hnswlib

Elixir binding for the hnswlib library.
Apache License 2.0
52 stars 9 forks source link

HNSWLib causes crash loop in fly.io app #10

Closed cbh123 closed 11 months ago

cbh123 commented 11 months ago

Sorry for the repeat issues here — I would just love to get this to work in production! As of the latest, I updated my dep to use {:hnswlib, "~> 0.1.2"} and it deploys. Thanks!

The issue is that the app then continually crashes in Fly. When I comment out {:hnswlib, "~> 0.1.2"} in my mix.exs, the app deploys + runs fine. When I add it back in, it works locally, but after being deployed the app continually crashes with a bunch of logs that look like this:

2023-09-14T19:25:10Z app[9080e710a35918] ewr [info]19:25:10.238 [notice] SIGTERM received - shutting down
2023-09-14T19:25:10Z app[9080e710a35918] ewr [info]19:25:10.239 [info] Shutting down 63 sockets in 1 rounds of 2000ms
error.message="instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)" 2023-09-14T19:25:10Z proxy[e148e02a535089] ewr [error]request.method="GET" request.id="01HAAJF1T07G0D67YXK22PQ3E6-lga" 
error.message="instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)" 2023-09-14T19:25:10Z proxy[9080e710a35918] ewr [error]request.method="GET" request.id="01HAAJF1T07G0D67YXK22PQ3E6-lga" 
2023-09-14T19:25:11Z app[9080e710a35918] ewr [info] WARN Reaped child process with pid: 447 and signal: SIGUSR1, core dumped? false
2023-09-14T19:25:12Z app[9080e710a35918] ewr [info] INFO Main child exited normally with code: 0
2023-09-14T19:25:12Z app[9080e710a35918] ewr [info] INFO Starting clean up.
2023-09-14T19:25:12Z app[9080e710a35918] ewr [info] WARN hallpass exited, pid: 374, status: signal: 15 (SIGTERM)
2023-09-14T19:25:12Z app[9080e710a35918] ewr [info]2023/09/14 19:25:12 listening on [fdaa:0:3978:a7b:15e:8103:b31b:2]:22 (DNS: [fdaa::3]:53)
2023-09-14T19:25:13Z app[9080e710a35918] ewr [info][  248.805348] reboot: Restarting system
2023-09-14T19:25:14Z app[9080e710a35918] ewr [info][    0.212381] PCI: Fatal: No config space access function found

I'm struggling to figure out why this is. The only difference I can tell is that when I deploy without hnswlib, I see this:

WARNING The app is not listening on the expected address and will not be reachable by fly-proxy.
You can fix this by configuring your app to listen on the following addresses:
  - 0.0.0.0:5432
Found these processes inside the machine with open listening sockets:
  PROCESS                                                                                                                                                                                                                                                                                                                                                                                                                                          | ADDRESSES                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------*-------------------------------------
  /app/erts-14.0.2/bin/beam.smp -- -root /app -bindir /app/erts-14.0.2/bin -progname erl -- -home /nonexistent -- -proto_dist inet6_tcp -- -noshell -s elixir start_cli -mode embedded -setcookie R3GUFIFBBQVUKLU2EVAI7IDRJGGQZCKTPUW6DRGNCEP6VE6VCW4A==== -name emoji-01HAAKJAEN3V62D0B5PJB9NKXM@fdaa:0:3978:a7b:ab2:bd88:9a:2 -config /app/releases/0.1.0/sys -boot /app/releases/0.1.0/start -boot_var RELEASE_LIB /app/lib -- -extra --no-halt | [::]:33955, [::]:8080               
  /.fly/hallpass                                                                                                                                                                                                                                                                                                                                                                                                                                   | [fdaa:0:3978:a7b:ab2:bd88:9a:2]:22  
  /app/erts-14.0.2/bin/epmd -daemon                                                                                                                                                                                                                                                                                                                                                                                                                | 0.0.0.0:4369, [::]:4369            

When I deploy WITH hnsw, the warning is slightly different:

 WARNING The app is not listening on the expected address and will not be reachable by fly-proxy.
You can fix this by configuring your app to listen on the following addresses:
  - 0.0.0.0:8080
  - 0.0.0.0:5432
Found these processes inside the machine with open listening sockets:
  PROCESS                           | ADDRESSES                           
------------------------------------*-------------------------------------
  /.fly/hallpass                    | [fdaa:0:3978:a7b:ab2:bd88:9a:2]:22  
  /app/erts-14.0.2/bin/epmd -daemon | 0.0.0.0:4369, [::]:4369          

In both cases, the app successfully deploys. But when {:hnswlib, "~> 0.1.2"} is present, the app continually crashes.

I tried bumping up the memory but that didn't seem to help either.

Originally posted by @cbh123 in https://github.com/elixir-nx/hnswlib/issues/8#issuecomment-1719882037

josevalim commented 11 months ago

I should clarify this library is still alpha software, so I wouldn't be surprised if there are big bugs lurking behind. However, it seems the application asked to shut down, because it says a SIGTERM was received. Was there anything logged before SIGTERM?

2023-09-14T19:25:10Z app[9080e710a35918] ewr [info]19:25:10.238 [notice] SIGTERM received - shutting down

Otherwise why would it get the sigterm?

You can also add config :logger, handle_sasl_reports: true, level: :info at the bottom of config/prod.exs to get very verbose logging and see if something else is triggering it.

cbh123 commented 11 months ago

Hmm, I don't seem to see anything more descriptive — I'll keep debugging and see how far I can get. Thanks for your help!

2023-09-14T21:01:15Z runner[9080e710a35918] ewr [info]Pulling container image registry.fly.io/emoji:deployment-01HAAQVEXPGRRCGXDSANF5SVVC
2023-09-14T21:01:16Z runner[9080e710a35918] ewr [info]Successfully prepared image registry.fly.io/emoji:deployment-01HAAQVEXPGRRCGXDSANF5SVVC (1.153981745s)
2023-09-14T21:01:16Z runner[9080e710a35918] ewr [info]Configuring firecracker
2023-09-14T21:01:16Z app[9080e710a35918] ewr [info] INFO Sending signal SIGTERM to main child process w/ PID 372
2023-09-14T21:01:16Z app[9080e710a35918] ewr [info]21:01:16.700 [notice] SIGTERM received - shutting down
2023-09-14T21:01:16Z app[9080e710a35918] ewr [info]21:01:16.701 [info] Shutting down 62 sockets in 1 rounds of 2000ms
2023-09-14T21:01:16Z proxy[9080e710a35918] ewr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-09-14T21:01:16Z proxy[e148e02a535089] ewr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-09-14T21:01:17Z app[9080e710a35918] ewr [info] WARN Reaped child process with pid: 446 and signal: SIGUSR1, core dumped? false
2023-09-14T21:01:17Z app[9080e710a35918] ewr [info] WARN Reaped child process with pid: 449 and signal: SIGUSR1, core dumped? false
2023-09-14T21:01:18Z app[9080e710a35918] ewr [info] INFO Main child exited normally with code: 0
2023-09-14T21:01:18Z app[9080e710a35918] ewr [info] INFO Starting clean up.
2023-09-14T21:01:18Z app[9080e710a35918] ewr [info] WARN hallpass exited, pid: 373, status: signal: 15 (SIGTERM)
2023-09-14T21:01:18Z app[9080e710a35918] ewr [info]2023/09/14 21:01:18 listening on [fdaa:0:3978:a7b:15e:8103:b31b:2]:22 (DNS: [fdaa::3]:53)
2023-09-14T21:01:19Z app[9080e710a35918] ewr [info][ 3865.987272] reboot: Restarting system
2023-09-14T21:01:19Z app[9080e710a35918] ewr [info][    0.141953] PCI: Fatal: No config space access function found
2023-09-14T21:01:20Z app[9080e710a35918] ewr [info] INFO Starting init (commit: 9fc6a62)...
2023-09-14T21:01:20Z app[9080e710a35918] ewr [info] INFO Preparing to run: `/app/bin/server` as nobody
2023-09-14T21:01:20Z app[9080e710a35918] ewr [info] INFO [fly api proxy] listening at /.fly/api
2023-09-14T21:01:20Z app[9080e710a35918] ewr [info]2023/09/14 21:01:20 listening on [fdaa:0:3978:a7b:15e:8103:b31b:2]:22 (DNS: [fdaa::3]:53)
2023-09-14T21:01:23Z proxy[e148e02a535089] ewr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-09-14T21:01:26Z proxy[9080e710a35918] ewr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
josevalim commented 11 months ago

@cbh123 and you are just loading hnswlib, right? You are not even trying to call any of its functions, right?! If that's the case, could you also try depending only on {:dll_loader_helper_beam, "~> 1.0"} and see if the same issue happens?

cbh123 commented 11 months ago

Yes exactly! This is without calling any of its functions. Just tried depending only on {:dll_loader_helper_beam, "~> 1.0"} and that works fine.

cbh123 commented 11 months ago

I just tried again, and I'm noticing DB connection errors when I include {:hnswlib, "~> 0.1.2"}. So I'm wondering now if this is a memory issue? It seems kind of strange because I wouldn't think hnsw has anything to do with the DB?

2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]Start Call: Phoenix.LiveView.Channel.start_link/?
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]21:18:31.662 [error] an exception was raised:
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]  1. Ensuring your database is available and that you can connect to it
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]        (telemetry 1.2.1) /app/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]21:18:31.662 [error] an exception was raised:
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]    ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 492ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]  1. Ensuring your database is available and that you can connect to it
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]  2. Tracking down slow queries and making sure they are running fast enough
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]  3. Increasing the pool_size (although this increases resource consumption)
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]See DBConnection.start_link/2 for more information
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]        (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:945: Ecto.Adapters.SQL.execute/6
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]        (ecto 3.10.3) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]        (phoenix_live_view 0.18.18) lib/phoenix_live_view/channel.ex:1037: Phoenix.LiveView.Channel.verified_mount/8
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]21:18:31.662 [error] Child :undefined of Supervisor #PID<0.2188.0> (Supervisor.Default) terminated
2023-09-14T21:18:31Z app[9080e710a35918] ewr [info]** (exit) an exception was raised:

That issue then leads to the non-startable state error: 2023-09-14T21:18:45Z proxy[9080e710a35918] ewr [error]machine is in a non-startable state: created,

which leads to the loop I was seeing above: machines API returned an error: "rate limit exceeded"

jeregrine commented 11 months ago

It looks like you have a port configured in your fly.toml but aren’t attaching any sockets to it. Normally this would fail healthchecks and fly will assume the app isn't running. It’s right in your error messages in the logs. Does tour fly.toml have a http_services section? If so probably remove it and try again. https://fly.io/docs/reference/configuration/

If you do intend to deploy a phoenix app make sure your port is configured.

cbh123 commented 11 months ago

Hmm, I'm using the default fly.toml so I do have a port configured.

The weird part is that if I deploy without the {:hnswlib, "~> 0.1.2"}, it works totally fine. If I deploy with that dependency, I get the above errors.

jeregrine commented 11 months ago

I am on my phone so I can’t reply with specific quotes but If you look at the log messages in both cases they have the same error at the top:

instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)" 2023-09-14T19:25:10Z proxy[e148e02a535089] ewr [error]request.method="GET" request.id="01HAAJF1T07G0D67YXK2

this is the healthcheck failing. Fly launch by default assumes a phoenix app. And if your fly.toml has the ‘http_service’ section fly will assume the app has crashed if it doesn’t attach to the port. Try configuring phoenix or removing that section.

jeregrine commented 11 months ago

I will take a look tomorrow morning and see if I can debug/reproduce this if you still see it.

cbh123 commented 11 months ago

Thanks so much! In the meantime I'll try troubleshooting.

And no need for you to spend too much time of course -- I have some other workarounds I can make to get around this dependency. So only feel the need to dive in if it's a fun challenge :)

cbh123 commented 11 months ago

One update: I figured out why I was getting the healthcheck warning, even without hnswlib. A few days ago I was trying to add a proxy to access the DB, and added the following services. The healthcheck warning goes away after I remove these.

[[services]]
  internal_port = 5432 # Postgres instance
  protocol = "tcp"

[[services.ports]]
  handlers = ["pg_tls"]
  port = 5432

That said, when I add back the hnswlib dep, the app still crashes. I think it's a problem during the mix deps.compile step in production. I'm still digging into it.

This is the warning btw that I see when I add back the {:hnswlib, "~> 0.1.2"} dep:

  [1/2] Machine e148e02a535089 [app] update finished: success

WARNING The app is not listening on the expected address and will not be reachable by fly-proxy.
You can fix this by configuring your app to listen on the following addresses:
  - 0.0.0.0:8080
Found these processes inside the machine with open listening sockets:
  PROCESS                           | ADDRESSES                           
------------------------------------*-------------------------------------
  /.fly/hallpass                    | [fdaa:0:3978:a7b:ab2:bd88:9a:2]:22  
  /app/erts-14.0.2/bin/epmd -daemon | 0.0.0.0:4369, [::]:4369             
  [2/2] Machine 9080e710a35918 [app] update finished: success
  Finished deploying
cbh123 commented 11 months ago

Ah! I'm fairly certain I see the problem. I noticed this error during the mix deps.compile. It flashes by quickly, and I'm not sure where else I can see these logs — so it's quite hard to notice. But then I piped the deploy messages into a txt file, and I noticed this line:

#19 42.64 Failed to load nif: {:load_failed, ~c"Failed to load NIF library: '/usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /app/_build/prod/lib/hnswlib/priv/hnswlib_nif.so)'"}

Edit: I got it! GPT-4 helped me figure out what I needed to add to my Dockerfile to get this to work. Thanks for your help debugging! For a future person that comes across this error, here's how to fix it:

----------GPT-4 Chat ----------- The error message indicates that the hnswlib_nif.so library (which is a part of the hnswlib dependency) requires a version of libstdc++.so.6 that includes GLIBCXX_3.4.29, but it's not found in your Docker image.

This issue can be resolved by upgrading the libstdc++6 package in your Docker image to a version that includes GLIBCXX_3.4.29.

You can do this by adding the following lines to your Dockerfile:

# Add Debian testing repository
RUN echo 'deb http://ftp.debian.org/debian testing main' >> /etc/apt/sources.list

# Upgrade libstdc++6
RUN apt-get update -y && apt-get -t testing install -y libstdc++6

These lines should be added in both the builder and runner stages of your Dockerfile, right after the existing apt-get install commands.

Here's how your Dockerfile should look after the changes:

FROM ${BUILDER_IMAGE} as builder

# install build dependencies
RUN apt-get update -y && apt-get install -y build-essential libstdc++6 git curl erlang-dev cmake  \
    && apt-get clean && rm -f /var/lib/apt/lists/*_*

# Add Debian testing repository
RUN echo 'deb http://ftp.debian.org/debian testing main' >> /etc/apt/sources.list

# Upgrade libstdc++6
RUN apt-get update -y && apt-get -t testing install -y libstdc++6

And in the runner stage:

FROM ${RUNNER_IMAGE}

RUN apt-get update -y && apt-get install -y libstdc++6 openssl libncurses5 locales \
  && apt-get clean && rm -f /var/lib/apt/lists/*_*

# Add Debian testing repository
RUN echo 'deb http://ftp.debian.org/debian testing main' >> /etc/apt/sources.list

# Upgrade libstdc++6
RUN apt-get update -y && apt-get -t testing install -y libstdc++6

After making these changes, rebuild your Docker image and redeploy your application. This should resolve the issue with the missing GLIBCXX_3.4.29 version.