DB connection may go down and not retry connection

mspanc commented 8 years ago

I am using db_connection 0.2.5 with postgrex 0.11.1 and elixir 1.2.5. I am aware that there's a newer version of db_connection available but the issue is so significant that I have decided to report it anyway.

Today our app located on heroku has silently stopped working. It apparently happened after postgresql on heroku was temporarily unavailable. After certain amount of retries it just reported "Postgrex.Protocol (#PID<0.428.0>) disconnected: \ (DBConnection.Error) client down".

It is rather critical issue as because the app is not crashing totally, heroku does not restart it so it continues to operate without notification about issue besides logs.

The log file is attached. radiokit-agenda-prod_2016-06-20_022140_2016-06-20_022240.log.gz

josevalim commented 8 years ago

@mspanc are you starting your app in production and do you have something like start_permanent: Mix.env == :prod in your mix.exs file?

mspanc commented 8 years ago

@josevalim yep,

def project do
    [app: :agenda,
     version: hd(String.split(File.read!("VERSION"), "\n", size: 2)),
     elixir: "~> 1.2.0",
     elixirc_paths: elixirc_paths(Mix.env),
     compilers: [:phoenix] ++ Mix.compilers,
     build_embedded: Mix.env == :prod or Mix.env == :stag,
     start_permanent: Mix.env == :prod or Mix.env == :stag,
     deps: deps,
     preferred_cli_env: [espec: :test]]
  end

josevalim commented 8 years ago

@mspanc checking the logs it seems at the end it could not connect to the database due to a connection limit. In future requests, did you see that message still happening over and over again? In such cases, I would expect one of two things to happen: 1. the whole app to crash 2. for it to keep trying over and over again.

Meanwhile, one of the reason for you to not get new connections may be related to the pool size. Make sure you update your deploy to the latest Heroku instructions here: http://www.phoenixframework.org/docs/heroku (check for POOL_SIZE).

josevalim commented 8 years ago

@mspanc btw, I have just tried to reproduce this issue locally to see if master could still present such issue and I could not reproduce it. :) Once I shut the database down, I get the connection errors over and over again.

Before I forget, thanks for the report and the logs!

mspanc commented 8 years ago

@josevalim I know it hit the DB pool limit (which is by the way lower that what heroku allows to use, and I have seen many times that ecto eats more connections than amount that is set in the config) but in such case I would also expect it to retry but it just stopped.

You've mentioned future requests that may trigger this. Actually the app is supposed to do a lot of stuff in the background in the worker process, so there is no request to trigger that. Any suggestion how to do this on purpuse?

josevalim commented 8 years ago

Marcin, you don't need a request to trigger the pool. It should indeed just continue to try to connect over and over again.

Maybe the pool connected back but your workers never started again? So Could it be that your own workers were not restarted after failing? How are you supervising them? What is the strategy?

Also, you could add a health check endpoint you can request in such cases and that does a simple database query.

José Valimwww.plataformatec.com.br http://www.plataformatec.com.br/Founder and Director of R&D

mspanc commented 8 years ago

I'll check that out.

fishcakez commented 8 years ago

Postgrex will do a naive health check every ~15 seconds in DBConnection 0.2.5. In latest I think it is every 1 second, perhaps that is a little too frequent. You can see these health checks fail a couple of times in the logs as this is the only time we can get an async_recv error:

206 <190>1 2016-06-20T02:22:10.255375+00:00 app web.1 - - 02:22:10.255 [error] Postgrex.Protocol (#PID<0.389.0>) disconnected: ** (Postgrex.Error) ssl async_recv: timeout

A query isn't carried out but Postgrex sends "sync" and Postgresql sends back "ready" and its status.

DBConnection will retry to connect randomly every 200ms to 15 seconds in DBConnection 0.2.5. The min and max are slightly higher in latest because these values result in too many failed attempts. Every disconnect and failed connect is logged. So if you aren't getting log messages then you should be connected to the database.

mspanc commented 8 years ago

@fishcakez ok then I will investigate our strategies.

josevalim commented 8 years ago

@mspanc please let us know if there is any other input you need from our end or if you find new information. :)

elixir-ecto / db_connection

DB connection may go down and not retry connection #45