elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
309 stars 112 forks source link

Elixir.ErlangErrorDBConnection.Poolboy.checkout/3 #107

Closed samsondav closed 6 years ago

samsondav commented 6 years ago

We've seen this error a couple of times in production. Stack trace:

Elixir.ErlangError: Erlang error: {:timeout, {:gen_server, :call, [#PID<0.2180.0>, {:checkout, #Reference<0.2496162780.1105199106.71425>, true, 15000}, 5000]}}
  File "lib/db_connection/poolboy.ex", line 112, in DBConnection.Poolboy.checkout/3
  File "lib/db_connection.ex", line 920, in DBConnection.checkout/2
  File "lib/db_connection.ex", line 742, in DBConnection.run/3
  File "lib/db_connection.ex", line 636, in DBConnection.execute/4
  File "lib/ecto/adapters/postgres/connection.ex", line 98, in Ecto.Adapters.Postgres.Connection.execute/4
  File "lib/ecto/adapters/sql.ex", line 256, in Ecto.Adapters.SQL.sql_call/6
  File "lib/ecto/adapters/sql.ex", line 436, in Ecto.Adapters.SQL.execute_or_reset/7
  File "lib/ecto/repo/queryable.ex", line 133, in Ecto.Repo.Queryable.execute/5

Any ideas why this might be happening and what we can do to resolve it?

fishcakez commented 6 years ago

This isn't an ErlangError so your logging tool isn't doing the correct normalization, this is an exit/1 call. The error, or rather exit, occurs because a connection waits 5_000 to checkout a socket from a connection process and doesn't get one. This could mean that the connection is blocking trying to connect, or blocking trying to handshake with the database, or a combination of the the two.

If the database went down or couldn't connect due to network issue would expect logged messages saying Postgrex.Protocol could not connect and other errors saying DBConnection.ConnectionError. Therefore my best guess is that it is due to slow handshake, do you happen to have lots of tables - I mean 100s, 1,000s, 10,000s, 100,000s? Its not possible to stop poolboy or the connection checkout completely as it may always happen.

samsondav commented 6 years ago

@mitchellhenke See above for incorrect ErlangError being reported to Sentry.

We do not have lots of tables - around 8-10 in total. Although we do have one table with ~400 columns.

Could it be a latency spike to the DB?

mitchellhenke commented 6 years ago

@samphilipd @fishcakez Thanks for the heads up 🙂

I put a PR together to improve this in Sentry: https://github.com/getsentry/sentry-elixir/pull/238

fishcakez commented 6 years ago

Could it be a latency spike to the DB?

It is hard to speculate why :(