Failed to connect (DBConnection.ConnectionError) tcp send: closed

marcelolebre commented 5 years ago

I'm currently running some stress tests for a production migration and have been tinkering with DB configs for throughput optimisation.

This is the result of a mix task in a phoenix app.

With the current configs several errors are displayed:

...

17:11:17.680 [error] Postgrex.Protocol (#PID<0.658.0>) failed to connect: ** (DBConnection.ConnectionError) tcp send: closed

17:11:17.680 [error] Postgrex.Protocol (#PID<0.654.0>) failed to connect: ** (DBConnection.ConnectionError) tcp send: closed

17:11:17.680 [error] Postgrex.Protocol (#PID<0.642.0>) failed to connect: ** (DBConnection.ConnectionError) tcp send: closed

17:11:17.680 [error] Postgrex.Protocol (#PID<0.532.0>) failed to connect: ** (DBConnection.ConnectionError) tcp send: closed

17:11:17.680 [error] Postgrex.Protocol (#PID<0.636.0>) failed to connect: ** (DBConnection.ConnectionError) tcp send: closed

17:11:17.680 [error] Postgrex.Protocol (#PID<0.675.0>) failed to connect: ** (DBConnection.ConnectionError) tcp send: closed

...

And a few seconds later other connections start to pickup and work resumes/starts as expected but for some reason several dozens of these error messages show up when the task is started.

This test runs for half an hour and these errors only show up at the beginning, no errors are shown in-flight and the test completes with success, the migration ends successfully.

Here are my current configs:

elixir -v

Erlang/OTP 22 [erts-10.4.1] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [hipe] [dtrace]

Elixir 1.8.2 (compiled with Erlang/OTP 21)

Dependencies

:db_connection, "2.0.6"
:ecto_sql, "3.0.5"
:ecto, "3.0.7"
:postgrex, "0.14.1"
:phoenix, "1.4.2"
:phoenix_ecto, "4.0.0"

Phoenix configs (I purposefully increased queue times):

I noticed that if I reduce the pool_size to 100, no errors are displayed.

# Configure your database
config :gaia, Gaia.Repo,
  username: "postgres",
  password: "postgres",
  database: "xxx",
  pool_size: 200,
  queue_target: 2_000,
  queue_interval: 30_000

Postgres configs

version: 11.3

max_connections = 1000
shared_buffers = 3072MB

Latest log entry in Postgres (no relevant insights found):

2019-06-21 17:11:08.421 WEST [61522] LOG:  database system is ready to accept connections

josevalim commented 5 years ago

My guess would be that one of the sides are not able to open up too many connections that fast or something blocking such a high amount of connections on a short amount of time We would need a mechanism to reproduce this locally to be sure.

Also, keep in mind that having too many connection like that usually ends-up with worse performance, because it means less caching and resource sharing at the same it utilizes more cpu and memory. Unless you are using a machine with a couple dozens of cores, that amount of connections seems excessive.

marcelolebre commented 5 years ago

3,6 GHz Intel Core i9 - 8 cores 32GM RAM MacOS

That's why I increased shared_buffers, though I didn't change kernel max segment size.

I agree, it does feel more like a physical-related issue. CPU and Memory stress are OK during the test, no starvation there.

Also agree on the performance deterioration part, although it's a diminishing returns kind of scenario, it'll either show up due to resource starvation or shear concurrency overhead of establishing the initial connection - I'd bet this last part is the real issue here.

josevalim commented 5 years ago

@marcelolebre, I will go ahead and close this issue, as I don't believe there is something we can act on here, as the DB is the one closing connections on this case. If you know of a reason that it is actually us forcing them DB to close them, please let us know!

elixir-ecto / db_connection