elixir-ecto / postgrex

PostgreSQL driver for Elixir
http://hexdocs.pm/postgrex/
1.1k stars 271 forks source link

weird state in TypeManager #362

Closed sasa1977 closed 6 years ago

sasa1977 commented 6 years ago

Postgrex version 0.13.3

I've encountered a weird state in Postgrex.TypeManager which I can't currently explain. I'm going to dump what I've gathered so far, so maybe someone else has an idea.

The manifestation of the problem was that a process trying to establish a connection with sync_connect: true would hang forever. Basically, Postgrex.start_link would never return. We had some processes stuck in this state for a couple of days.

After a bit of tracing, I've discovered the following stack trace in the connection process:

{:gen, :do_call, 4, [file: 'gen.erl', line: 169]},
  {GenServer, :call, 3, [file: 'lib/gen_server.ex', line: 771]},
  {Postgrex.Protocol, :bootstrap, 3,
   [file: 'lib/postgrex/protocol.ex', line: 621]},
  {Postgrex.Protocol, :handshake, 2,
   [file: 'lib/postgrex/protocol.ex', line: 475]},
  {DBConnection.Connection, :connect, 2,
   [file: 'lib/db_connection/connection.ex', line: 134]},
  {Connection, :init_it, 6, [file: 'lib/connection.ex', line: 424]},
  {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}

Poking around the code a bit, I concluded that the process ends up running an infinite loop, due to this condition. I've quickly traced the process and proved that it was indeed the case.

Digging deeper, I observed that Postgrex.TypeManager.get returned a pid of the dead process. It was always returning the same pid.

Finally, I invoked :sys.get_state on Postgrex.TypeManager process, and saw that the pid of the dead process is indeed cached in the process state.

Looking at the code, I couldn't figure out how that happened. I tried reproducing it locally by inserting a code which asynchronously kills the type server after a brief delay, but type manager correctly handled that case.

The final thing I did was to manually send the monitor message to the type server, to indicate that the process has indeed died. This was properly detected by the type manager, and at this point, all the queries were unblocked.

Looking at the code, I currently don't have any explanation for this. Unfortunately, I also can't reproduce it anymore. Once I've manually simulated the monitor message, everything came back to normal.

Does anyone else have a clue here?

EDIT: I just tried the following on the same BEAM instance, and it appears that all cached type servers are dead, except for the one which I restarted by manually sending the monitor message:


:sys.get_state(Postgrex.TypeManager) |> elem(0) |> Enum.map(fn {_, pid} -> pid end) |> Enum.reject(&Process.alive?/1)

[#PID<0.7534.164>, #PID<0.18393.164>, #PID<0.6709.162>, #PID<0.6742.162>,
 #PID<0.7630.164>, #PID<0.8526.163>, #PID<0.29855.163>, #PID<0.8600.163>,
 #PID<0.12130.162>, #PID<0.6669.162>, #PID<0.12062.162>, #PID<0.29869.163>,
 #PID<0.29820.163>, #PID<0.18391.164>, #PID<0.32756.170>, #PID<0.6731.162>,
 #PID<0.30772.162>, #PID<0.30767.162>]
josevalim commented 6 years ago

@sasa1977 did this happen again by any chance? I have looked at the code for the last 10 minutes and I have no idea how this would be even possible. We monitor the PIDs, so we should definitely receive the monitor down message. Under no situation we were supposed to have dead PIDs around.

In case this happens again, could you also get the whole TypeManager state for us to look at it?

sasa1977 commented 6 years ago

This still happens occasionally. I've just checked our logs for the past week, and I see it happened a couple of times on April 5th, but it didn't happen on any other day.

We use the following hack to detect the situation:

{keys, _} = :sys.get_state(Postgrex.TypeManager)
type_managers = Enum.map(keys, fn {_, pid} -> pid end)
dead_type_managers = Enum.reject(type_managers, &Process.alive?/1)
not Enum.empty?(dead_type_managers) # if true, something is off

This check is repeated every 10 seconds. If six consecutive checks fail, we log the error and restart the type manager. That's quite ugly but it keeps us afloat.

I should note that we have a less common usage pattern. Instead of using a pool of connections, we establish a one-off connection, do our stuff, and then drop it. This issue happens on a system which is used more frequently by external users (I don't have the numbers), so there's a bunch of connecting/disconnecting. On the other hand, I've never seen this issue on our internal staging systems, which also run for a long time, but are rarely used. So I wonder if frequent usage is one factor which leads to this issue.

Looking at the code of TypeManager I'm also unsure how can this happen. The only scenario I can think of is if monitor reference gets recycled (which with low probability can happen). In this case the new entry in the mons map would point to a different key, and so we could end up with a stale entry in the keys map. I admit that this seems very far fetched, but I can't think of any other explanation at the moment.

I'll try to expand our code which periodically issues :sys.get_state to the type manager to dump more information on failure. I'll also try to keep the small moving window of distinct past states, and dump that too. So perhaps that will give us more insight. Since the error happens fairly infrequently, it could take days (or maybe even a few weeks) before I have something.

sasa1977 commented 6 years ago

OK, so I was able to grab the state of a couple of failures, but I'm not much wiser.

Here's an example state in the time of failure:

{
  %{
    {Postgrex.DefaultTypes, {'airdb.mpi-sws.org', 5432, "taxi_link_pub"}} => #PID<0.8328.96>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census0_link"}} => #PID<0.12076.97>},
  %{
    #Reference<0.227081636.1990721540.15677> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census0_link"}},
    #Reference<0.227081637.378011649.68057> => {Postgrex.DefaultTypes, {'airdb.mpi-sws.org', 5432, "taxi_link_pub"}}
  }
}

The problematic entry is census0_link with a pid 0.12076.97. Unless something is wrong with my detection code (which is included in my previous comment), this process has been dead for the past minute.

I'm also keeping a list of previous distinct states. Here's the first previous state:

{
  %{
    {Postgrex.DefaultTypes, {'139.19.208.225', 5432, "cityemployeepayroll"}} => #PID<0.12073.97>,
    {Postgrex.DefaultTypes, {'airdb.mpi-sws.org', 5432, "taxi_link_pub"}} => #PID<0.8328.96>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "banking"}} => #PID<0.12119.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "banking_link"}} => #PID<0.12081.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census0"}} => #PID<0.12092.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census0_link"}} => #PID<0.12076.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census1"}} => #PID<0.12093.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census1_link"}} => #PID<0.12098.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "mixtest"}} => #PID<0.12075.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "scihub"}} => #PID<0.12108.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "scihub_link"}} => #PID<0.12058.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "taxi"}} => #PID<0.12067.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "taxi_link"}} => #PID<0.12047.97>,
    {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "testtable"}} => #PID<0.12089.97>},
  %{
    #Reference<0.227081636.1961361413.224152> => {Postgrex.DefaultTypes, {'139.19.208.225', 5432, "cityemployeepayroll"}},
    #Reference<0.227081636.2513174531.136414> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census1"}},
    #Reference<0.227081636.4233625602.25110> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "mixtest"}},
    #Reference<0.227081636.1990721540.15677> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census0_link"}},
    #Reference<0.227081636.2513174531.135933> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "taxi"}},
    #Reference<0.227081636.2513174531.135989> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "testtable"}},
    #Reference<0.227081636.2513174531.136216> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "scihub"}},
    #Reference<0.227081636.4233625602.24816> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "taxi_link"}},
    #Reference<0.227081637.378011649.68057> => {Postgrex.DefaultTypes, {'airdb.mpi-sws.org', 5432, "taxi_link_pub"}},
    #Reference<0.227081637.410517505.95023> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "banking"}},
    #Reference<0.227081637.410517505.95067> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census1_link"}},
    #Reference<0.227081637.410517505.95169> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "census0"}},
    #Reference<0.227081637.410517505.95556> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "banking_link"}},
    #Reference<0.227081637.410517505.95727> => {Postgrex.DefaultTypes, {'srv-76-132.mpi-sws.org', 5432, "scihub_link"}}
  }
}

and the one before

{
  %{{Postgrex.DefaultTypes, {'airdb.mpi-sws.org', 5432, "taxi_link_pub"}} => #PID<0.8328.96>},
  %{#Reference<0.227081637.378011649.68057> => {Postgrex.DefaultTypes, {'airdb.mpi-sws.org', 5432, "taxi_link_pub"}}}
}

The way I read this flow is that at some point a bunch of connections to different databases were opened, some queries were executed, and than those connections were closed. However, for some reasons the :DOWN message for the census0_link database has not been handled.

I think that the log definitely rejects my hypothesis about recycled monitor refs.

Some other causes which come to mind are:

All of these hypotheses seem very unlikely to me, but I don't have any other possible explanation.

I'm thinking about forking postgrex and doing an explicit Process.alive? check directly in the type manager process, followed by the removal of a dead process and a retry. This should definitely solve the problem, although I'm not comfortable that the issue is not properly understood.

josevalim commented 6 years ago

The type manager is overloaded and so the :DOWN message is in the queue, but not yet processed

This could actually be possible. From what I understand, you are spawning multiple connections, which requires the bootstrap to happen multiple times. If you are not serializing the spawning of those connections, you can get multiple requests on the server. Maybe it is worth tracking the message length queue?

Another option is to migrate the TypeManager to Elixir's v1.4 Registry, which is descentralized and it would not suffer of such issues.

sasa1977 commented 6 years ago

From what I understand, you are spawning multiple connections, which requires the bootstrap to happen multiple times.

Yes, each connection is spawned on end-users request. We don't have a lot of users, but they are running some attack scripts, so we end up with a lot of frequent connecting/disconnecting.

However, in the problematic situations we end up with one minute of an inconsistent state, which seems a lot. Moreover, when I was originally debugging the problem, it was on a query which got stuck for a couple of days, and there was no activity on the server. So I'm not sure that type manager overload is the only problem here.

Regardless of that, I agree it's possible that the type manager keeps a stale pid for a brief time, depending on the load. From what I've gathered, that situation is handled in the client, which retries if the type server is dead. Using a Registry, or detecting if a process is dead in the type manager seem like a better solution, because they don't require repeated calls to the type manager.

josevalim commented 6 years ago

@sasa1977 a PR was sent on #381. Unfortunately we cannot merge into v0.13 since v0.13 still supports Elixir v1.3 but you can almost cleanly backport it to v0.13 yourself. There is a conflict but only because we have added a newline around the code in a previous commit.

In case your server is behind, then the PR above should also be beneficial to performance, as the registry is decentralized.

sasa1977 commented 6 years ago

I'll try to cherry pick it, maybe tomorrow if I manage to find some time.