elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
306 stars 113 forks source link

Improve context for timeout errors logged #195

Closed tpitale closed 5 years ago

tpitale commented 5 years ago

We're seeing a randomly occurring error (does not correlate to slow queries or increased load on our database server):

Postgrex.Protocol (#PID<0.393.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.23831.49> timed out because it queued and checked out the connection for longer than 30000ms

Which I believe comes from ConnectionPool's timeout handling. Unfortunately, there is no record of this in our slow query logs.

I'm not sure which direction to take when debugging further. If possible, could more information be added that might help direct us to the cause of this timeout? Perhaps some sort of backtrace or more information about the processes with these PIDs?

Is it possible that this log is a mistake? I see there is a check for the current holder:

Check that timeout refers to current holder (and not previous)

josevalim commented 5 years ago

Is there a stacktrace for the error? Also, can you please include your DBConnection version?

tpitale commented 5 years ago

Sorry! db_connection version 2.0.5.

No stacktrace has been logged. I think it's coming from: https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/connection_pool.ex#L90

tpitale commented 5 years ago

Which appears to maybe be logged here: https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/connection.ex#L114

tpitale commented 5 years ago

I'll try to wait for the next error to happen and see if Process.info() gives me more info on the PID, if it's alive.

josevalim commented 5 years ago

@tpitale yup, adding a Process.info is exactly what I had in mind. But we would do it inside DBConnection. Something like this:

  message = "client #{inspect pid} timed out because it checked out " <>
        "the connection for longer than #{timeout}ms"

  case Process.info(pid, :current_stacktrace) do
    {:current_stacktrace, stacktrace} ->
      message <> "\n\n#{inspect pid} was at location:\n\n" <>
        Exception.format_stacktrace(stacktrace)
    _ ->
      message
  end

Can you please send a PR?

tpitale commented 5 years ago

@josevalim Would you like something similar in the timeout behavior in Connection, too? Or should I do that as its own PR?

josevalim commented 5 years ago

@tpitale I think doing so in the Connection module would only point to DBConnection internals? So I am not sure if that is useful. So let's go in steps. Does the one proposed above provide enough information to fix the issue? :)

josevalim commented 5 years ago

Oh, I misunderstood your question. Yes, both places in the PR look good! :+1: