elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
306 stars 113 forks source link

Timeouts on Poolboy.checkout #127

Closed Coffei closed 6 years ago

Coffei commented 6 years ago

Hi, I have recently started seeing the below error in my app

14:39:19.261 [error] GenServer AssetMap.Storage.Stats terminating
** (stop) exited in: :gen_server.call(#PID<0.6836.0>, {:checkout, #Reference<0.3038416112.2899312641.205728>, true, 900000}, 5000)
    ** (EXIT) time out
    (db_connection) lib/db_connection/poolboy.ex:112: DBConnection.Poolboy.checkout/3
    (db_connection) lib/db_connection.ex:928: DBConnection.checkout/2
    (db_connection) lib/db_connection.ex:750: DBConnection.run/3
    (db_connection) lib/db_connection.ex:592: DBConnection.prepare_execute/4
    (ecto) lib/ecto/adapters/postgres/connection.ex:86: Ecto.Adapters.Postgres.Connection.execute/4
    (ecto) lib/ecto/adapters/sql.ex:256: Ecto.Adapters.SQL.sql_call/6
    (stdlib) timer.erl:166: :timer.tc/1
    (utils) lib/stats.ex:32: Utils.Stats.timefun/2
Last message: :report_stats

It started appearing after I upgraded from Fedora 27 to 28, before the upgrade I never saw such error. The code and dependencies are the same. Still using roughly the same Erlang and Elixir versions, now using Erlang 20.3 and Elixir 1.6.5, I also tried Elixir 1.6.0 and a couple of versions in between - all behave the same. My guess is that it's something in the system that changed versions.

The error occurs quite randomly, some DB queries go through while some time out. It seems to affect my stats GenServer the most, the stats GenServer is running in background and executes COUNT(*) on some of my tables every 5s. It seems to affect 1-2 in every 10 requests. Note the queries are not intensive in any way, when executed directly they return in under 100ms.

The database is Postgres running in a docker. Although I created new containers the image itself hasn't changed, so I assume the database side of things remained intact.

My colleagues are not experiencing the issue, although none of them run Fedora 28. The app is unfortunately not open source.

I am not quite sure where to start debugging this. Could you provide some pointers where to start? Debug logs provide no further information.

josevalim commented 6 years ago

Do you have queries or long transactions running that may be blocking the connection pool? Because what is likely blocking the pool is not the query that is causing this particular error, but rather the other queries in your application.

Things you could do to improve:

  1. Increase the pool_size
  2. Change to DBConnection.Sojourn pool
  3. Maybe increase the timeout?
Coffei commented 6 years ago

No long transactions, in fact the COUNT(*) queries are the only queries that are executed as the issue occurs even when the system is in idle and does nothing else. Is it somehow possible to see what is holding the connections?

I'll try increasing the pool size and/or the timeout.

josevalim commented 6 years ago

If nothing else is happening in the app, then it is not going to be the pool size or timeout. Did you try using another OTP version than 20.3?

Coffei commented 6 years ago

No, using the one bundled with Fedora. I can try 20.0, I was definitely using version >= 20.0 before when it worked fine, but I can try with 19.x too.

Coffei commented 6 years ago

OK, it seems to be working fine with OTP 20.0. I'll be testing some more until tomorrow just to be sure. Let me know if I should close this issue if the timeout doesn't appear on 20.0 or if you'd want me to keep this open or provide more info.

michalmuskala commented 6 years ago

I've seen some similar reports about OTP 20.3 as well. I'm not really sure what could be causing this.

Coffei commented 6 years ago

Hmm.. I can still see the error with 20.0 occasionally, it's definitely much more rare, but it's there.

michalmuskala commented 6 years ago

Do you have any metrics measuring the queue and query time from ecto logger? This could help verifying if it's something weird or just connections getting tied up somewhere.

fishcakez commented 6 years ago

The timeout is occurring against the connection process not against poolboy, so nothing is blocking the pool. This means queue and query times likely look good (except the timeout failures). This means the pool has a free connection process but that process is not responding in 5 seconds. There is a 900 second timeout, which means the pool will let the calling process use the connection for 900 seconds! That seems a little high. Perhaps increasing the pool_timeout, which is currently 5000 by default might help in this situation. The pool_timeout is how long to wait for a connection.

However if the connection process can not return a socket within 5 seconds then it suggests something else is not right. My assumption would be that either there is a bug in the connection logic (race condition?) or the database/network is down.

fishcakez commented 6 years ago

It seems to affect 1-2 in every 10 requests

If this is the case then its possible that a single connection process is in a bad state. The default pool size is 10 and we use a FIFO queue, so if there is no or little concurrent pool access a single connection process would get used every 10 requests. If you hit this bug against you could call :sys.get_state(#PID<0.6836.0>) on the connection so we can see its state. The pid in there is the one from the error output, can use pid(0,6836,0) to create the pid. Note that the state WILL contain sensitive information (password) so make sure to filter that. If in doubt email the output to someone in the thread.

Coffei commented 6 years ago

@michalmuskala The only queries are simple count(*) queries, nothing else should be running. Below is the a part of the debug log showing the times.

08:48:55.204 [debug] QUERY OK db=64.0ms
SELECT count(*) FROM asset []

08:48:55.272 [debug] QUERY OK db=68.0ms queue=0.1ms
SELECT count(*) FROM edge []

08:49:25.376 [debug] QUERY OK db=102.8ms queue=0.2ms
SELECT count(*) FROM asset []

08:49:25.444 [debug] QUERY OK db=67.3ms queue=0.1ms
SELECT count(*) FROM edge []

@fishcakez It really looks like a broken process, I see it's always the same PID in the error message. I tried getting the state but I get timeout

** (exit) exited in: :sys.get_state(#PID<0.859.0>)
    ** (EXIT) time out
    (stdlib) sys.erl:305: :sys.send_system_msg/2
    (stdlib) sys.erl:114: :sys.get_state/1

This is the process info

[
  current_function: {:prim_inet, :recv0, 3},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 12,
  messages: [
    {:inet_async, #Port<0.14311>, 475, {:ok, <<90, 0, 0, 0, 5, 73>>}},
    {:"$gen_call", {#PID<0.878.0>, #Reference<0.3722681502.2001731585.239489>},
     {:checkout, #Reference<0.3722681502.2001731585.239488>, true, 900000}},
    {:"$gen_cast", {:cancel, #Reference<0.3722681502.2001731585.239488>}},
    {:"$gen_call", {#PID<0.1193.0>, #Reference<0.3722681502.2001731585.240793>},
     {:checkout, #Reference<0.3722681502.2001731585.240792>, true, 900000}},
    {:"$gen_cast", {:cancel, #Reference<0.3722681502.2001731585.240792>}},
    {:"$gen_call", {#PID<0.1199.0>, #Reference<0.3722681502.2001731585.242087>},
     {:checkout, #Reference<0.3722681502.2001731585.242086>, true, 900000}},
    {:"$gen_cast", {:cancel, #Reference<0.3722681502.2001731585.242086>}},
    {:system, {#PID<0.1184.0>, #Reference<0.3722681502.2001731585.243359>},
     :get_state},
    {:"$gen_call", {#PID<0.1203.0>, #Reference<0.3722681502.2001731585.243465>},
     {:checkout, #Reference<0.3722681502.2001731585.243464>, true, 900000}},
    {:"$gen_cast", {:cancel, #Reference<0.3722681502.2001731585.243464>}},
    {:system, {#PID<0.1184.0>, #Reference<0.3722681502.2001731585.243550>},
     :get_state},
    {:system, {#PID<0.1184.0>, #Reference<0.3722681502.2001731585.243662>},
     :get_state}
  ],
  links: [#PID<0.854.0>, #PID<0.855.0>, #PID<0.869.0>, #Port<0.14311>],
  dictionary: [
    "$ancestors": [#PID<0.855.0>, AssetMap.Ecto.Repo.Pool, AssetMap.Ecto.Repo,
     AssetMap.AppSupervisor, #PID<0.851.0>],
    "$initial_call": {DBConnection.Connection, :init, 1}
  ],
  trap_exit: false,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.850.0>,
  total_heap_size: 53194,
  heap_size: 6772, 
  stack_size: 39,
  reductions: 55781,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 5
  ],
  suspending: []
]
michalmuskala commented 6 years ago

Interesting. It looks like the process is blocking in a socket receive for data that never comes... Did you try calling it after some time to see if this changes or is the process stuck doing just this? Could you also call Process.info(pid, :current_stacktrace) on the process when this happens? This should show us a bit more about what exactly the process is doing and maybe also which receive call exactly that is.

Coffei commented 6 years ago

I have the app running for about 30 minutes and the process is still stuck. In fact, another process that didn't seem to be stuck at first got stuck after a while, it's spreading like a disease :D.

The stack trace is the same for both processes

{:current_stacktrace,
 [
   {:prim_inet, :recv0, 3, []},
   {Postgrex.Protocol, :msg_recv, 4,
    [file: 'lib/postgrex/protocol.ex', line: 1956]},
   {Postgrex.Protocol, :ping_recv, 4,
    [file: 'lib/postgrex/protocol.ex', line: 1705]},
   {DBConnection.Connection, :handle_info, 2,
    [file: 'lib/db_connection/connection.ex', line: 373]},
   {Connection, :handle_async, 3, [file: 'lib/connection.ex', line: 810]},
   {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 616]},
   {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 686]},
   {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}
 ]}

I also checked the connections from postgres and all seem to be idle.

michalmuskala commented 6 years ago

From that it looks like the database is not replying to ping requests. Did you also change version of postgres when you updated your system?

I also don't understand why the timeout is not firing there - we should not be able to get an infinity timeout in there 🧐 so it should eventually either return a response or die.

Coffei commented 6 years ago

Nope, the database definitely didn't change versions as its running in docker and the image is the same. It must be something in the system that changed versions..

Coffei commented 6 years ago

I think I've tried everything I could think of and still no progress. The issue seems dormant sometimes, and very active a minute later. I've tried the following and found no correlation.

I use the following to reproduce the issue

Enum.each(1..40, fn _ -> Task.async(fn -> AssetMap.Ecto.Repo.query("SELECT pg_sleep(1)") end) end)

after a while at least one connection enters the bad state.

I might try setting up a database locally to remove docker from the equation, and maybe tweaking pool size and timeouts but I am afraid that might only conceal the problem.

Can you think of anything else I could try?

josevalim commented 6 years ago

Thanks @Coffei for all the help distilling this. Since you can isolate the issue to that command, do you think you can reproduce it in a smaller application? Maybe even in one of the examples in our examples folder?

Coffei commented 6 years ago

Yes, I'll try to cook a reproducer. I tried this with a database running on a different system (colleague's pc) and could reproduce the same, so I'm fairly hopeful this has nothing to do with the database or docker. Still it seems to be highly specific to my setup, so I hope the reproducer will be of any use.

Coffei commented 6 years ago

I've put together a small reproducer using only Postgrex and db_connection with poolboy - https://github.com/Coffei/timeout_reproducer.

The error usually appears after few tries, sometimes I have to re-run the app a few times. The error is very similar

09:01:32.446 [error] Task #PID<0.283.0> started from #PID<0.276.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.179.0>, {:checkout, #Reference<0.1346686728.617611267.217656>, true, 5000}, 5000)
    ** (EXIT) time out
    (db_connection) lib/db_connection/poolboy.ex:112: DBConnection.Poolboy.checkout/3
    (db_connection) lib/db_connection.ex:928: DBConnection.checkout/2
    (db_connection) lib/db_connection.ex:750: DBConnection.run/3
    (db_connection) lib/db_connection.ex:1141: DBConnection.run_meter/3
    (db_connection) lib/db_connection.ex:592: DBConnection.prepare_execute/4
    (postgrex) lib/postgrex.ex:146: Postgrex.query/4
    (timeout_reproducer) lib/app.ex:28: anonymous fn/1 in TimeoutReproducer.App.test/0
    (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
Function: #Function<1.76782678/0 in TimeoutReproducer.App.test/0>
    Args: []

I also see the following error, which appears independently of the one above.

09:01:32.437 [error] Postgrex.Protocol (#PID<0.177.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.288.0> timed out because it checked out the connection for longer than 5000ms
Coffei commented 6 years ago

Also recorded asciinema - https://asciinema.org/a/jWGGvIE12qve7mCTHKqTBgw7I it appears on the first try so I restart the app and it doesn't come up any more.

fishcakez commented 6 years ago

It looks like the process is blocking in a socket receive for data that never comes

The message the port is expecting is at the head of the message queue isn't it?

Coffei commented 6 years ago

I found that increasing the pool size actually made things worse as there was a bigger chance a process enters the bad state. Increased the pool size in the reproducer to 100, so it should be easier to reproduce.

I've been testing on clean VMs and I could reproduce this only in F28, both F27 and Ubuntu 18.04 were working well, so I assume it's something in the system that's interfering. Note it was much more rare in the VMs, in a slower VirtualBox it took about 30 minutes to reproduce, so I'd say it's some race condition that only appears on sufficiently fast setups.

fishcakez commented 6 years ago

@Coffei my last message was suggesting it is a bug in Erlang/OTP can you share the exact OTP versions of all thoses?

Coffei commented 6 years ago

On my local box I tried with Erlang 20.0 and 20.3 (from source and from repository), on the VMs its 20.3 for F28, 19.3 for F27 and 20.2 for the Ubuntu VM. Kinda inconsistent, but we tried in pair with @lubosvesely, me running F28 and he Ubuntu (18 or 16) with the exact same versions of Erlang and Elixir (down to patch versions), same database and the same reproducer. And @lubosvesely still wasn't able to reproduce the error. And before I upgraded the PC last week and was running F27 with Erlang 20.x and Elixir 1.6.x I also didn't see the issue.

As far as I see it, the system must play some role in it.

fishcakez commented 6 years ago

@Coffei just to be clear the output you showed in https://github.com/elixir-ecto/db_connection/issues/127#issuecomment-396833857 should not be possible without a bug in Erlang/OTP. Of course it is possible that the race condition that causes it is more likely to occur on your machine. The reason for the exact OTP versions is so we can report this upstream if we get another report.

nixpulvis commented 6 years ago

I am also effected by this I believe. At random times (generally not right after starting the server) I get the following error, which is generally worked through by making a few more requests, one of them eventually works.

** (exit) exited in: :gen_server.call(#PID<0.789.0>, {:checkout, #Reference<0.1448561368.3619946498.18872>, true, 15000}, 5000)
    ** (EXIT) time out

Looking at the Process.info we get the following.

[
  current_function: {:prim_inet, :recv0, 3},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 3,
  messages: [
    {:inet_async, #Port<0.46938>, 1087, {:ok, <<90, 0, 0, 0, 5, 73>>}},
    {:"$gen_call", {#PID<0.1057.0>, #Reference<0.1448561368.3619946498.18873>},
     {:checkout, #Reference<0.1448561368.3619946498.18872>, true, 15000}},
    {:"$gen_cast", {:cancel, #Reference<0.1448561368.3619946498.18872>}}
  ],
  links: [#PID<0.786.0>, #PID<0.787.0>, #PID<0.804.0>, #Port<0.46938>],
  dictionary: [
    "$ancestors": [#PID<0.787.0>, Auth.Repo.Pool, Auth.Repo, Auth.Supervisor,
     #PID<0.783.0>],
    "$initial_call": {DBConnection.Connection, :init, 1}
  ],
  trap_exit: false,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.782.0>,
  total_heap_size: 3573,
  heap_size: 2586,
  stack_size: 39,
  reductions: 14872,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 12
  ],
  suspending: []
]

System information

Linux milik 4.16.9-1-ARCH #1 SMP PREEMPT Thu May 17 02:10:09 UTC 2018 x86_64 GNU/Linux
Erlang/OTP 20 [erts-9.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]
fishcakez commented 6 years ago

@Coffei do you have hipe compiled code enabled? It would be in the output above with [hipe] if so.

Coffei commented 6 years ago

Yes, all the envs I described have hipe enabled.

isaiahdw commented 6 years ago

We've noticed the same thing recently. No changes to the database (postgresql 9.6.4 on Heroku), but we did upgrade Elixir and Erlang/OTP recently. I'll do some more debugging, but it seems related to the Elixir/Erlang changes.

We also noticed that increasing the pool size made this problem happen more often.

fishcakez commented 6 years ago

@isaiahdw could you include the before/after Erlang/OTP versions?

isaiahdw commented 6 years ago

@fishcakez, sorry about the slow reply. We upgraded from OTP 20.1.6 to 20.3.8. Although I'm less sure it is related to the OTP version now. We've been running the last two days without any timeouts and we're still on OTP 20.3.

I'll post more details here if I'm able to reproduce or start seeing errors again.

fishcakez commented 6 years ago

@isaiahdw did you change OS version?

isaiahdw commented 6 years ago

@fishcakez, no OS changes. This is an application hosted on Heroku, so I can't be 100% sure about changes that I might not be aware of, but the only change we made was the Elixir and OTP updates.

Kabie commented 6 years ago

Today I upgraded the erlang to 20.3.8-1 on Archlinux, and hit this very issue.

oestrich commented 6 years ago

I just got this issue with Erlang 21.0 and Elixir 1.6.6. I was on Erlang 20.2 and Elixir 1.6.5 previously. Arch and hipe enabled for both version sets.

josevalim commented 6 years ago

@nixpulvis what is your OS?

nixpulvis commented 6 years ago

@josevalim

$ uname -a
Linux milik 4.16.9-1-ARCH #1 SMP PREEMPT Thu May 17 02:10:09 UTC 2018 x86_64 GNU/Linux

Also updated my original post.

semaphase commented 6 years ago

Having the same issue

$ uname -a Linux user 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC 2018 x86_64 GNU/Linux $ iex -V Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

josevalim commented 6 years ago

Thanks everybody for the info so far.

I have tried to reproduce this issue on a Virtual Box VM but failed. But probably due to my lack of knowledge of both Virtual Box and Fedora.

Can we make this easier to reproduce? Maybe someone can use Docker and get @Coffei's sample application inside Fedora 28 (or any other OS that reproduces the issue) and make it easier to reproduce via some command line instructions?

Can somebody please look into this? Thanks!

Coffei commented 6 years ago

I've added a dockerfile to the reproducer app, based on Fedora 28. You can run it as follows

  1. Run database docker run --name db -e POSTGRES_USER=assetmap -e POSTGRES_PASSWORD=assetmap -e POSTGRES_DATABASE=assetmap -d --net=host postgres:10
  2. Run the reproducer docker run --net=host -it coffei/timeout_reproducer This will open up iex console of the app.
  3. Now you can execute TimeoutReproducer.App.test multiple times as I mentioned previously. I tried it myself and reproducer on a second try.

Right now the image is still being built, should be done soon - https://hub.docker.com/r/coffei/timeout_reproducer/builds/

karlseguin commented 6 years ago

I don't have much to add, except to say we've had this same problem for over a year. It's been very infrequent (once every month or so), so we're only now starting to look seriously into it. We're running in EC2, connecting to RDS, Docker on Ubuntu.

I have two questions though:

1 - Is the ping being called on each connection checkout? Have people found this to be worthwhile? I believe that neither Go's lib/pg or jackc/pgx, nor Node's PG library bother to ping on checkout. The connection could just die between the ping and the real command so, if you care about this, one way or another, you'll have to have retry logic in the app.

2 - Postgrex is sending a SYNC command. Where does that idea come from? Googling, it seems that drivers that do support ping, do a select 1 (or select 'postgrex ping) using a simple query. For example, this reply from the psycopg2 developer says to just issue a select 1 if you must https://github.com/psycopg/psycopg2/issues/419 (psycopg2 also doesn't support pings)

josevalim commented 6 years ago

Thank you @Coffei! I'm downloading it now.

@karlseguin we only ping on idle_timeout, which defaults to 1 second. If this is the root cause, we can try a couple things:

  1. Increase the idle_timeout to 10_000 (10 seconds) and see if it reduces the frequency.
  2. Use another pool besides DBConnection.Poolboy. For example, try DBConnection.Sojourn.
josevalim commented 6 years ago

@Coffei Unfortunately I cannot reproduce the issue even with docker. I have it running the query periodically for the last 10 minutes and nothing yet. :( Any ideas?

Meanwhile, can you please try the two suggestions above (idle_timeout and changing the pool) and let us know if it changes anything?

Coffei commented 6 years ago

No idea, as I mentioned earlier it seems quite random and people with very similar setups were not able to reproduce it while I get it all the time. What worked for me was to restart the app if it didn't show the error after 1-2 tries, as the connections seemed to be often corrupted right after they were created.

I'll try the two suggestions and report back.

Coffei commented 6 years ago

I can see the issue with both suggestions. The idle timeout doesn't seem to have any effect, the Sojourn pool seems to swallow the errors a bit, but the error below still shows up while some queries don't finish properly, so I'd say it's also the same.

13:54:43.467 [error] Postgrex.Protocol (#PID<0.283.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.1013.0> timed out because it checked out the connection for longer than 5000ms
arjan commented 6 years ago

I also have this issue in my application. Recently upgraded to 1.6.6 and OTP 21 (arch). Postges 10.4.

Weirdly enough it only occurs on one repo, pretty consistently (my app has 2). I'll try to reproduce it with the sample app.

arjan commented 6 years ago

Yes, I have the same issue in this test app, happens after a few times already. More and more workers are being blocked in the ping_recv that is supposed to return quickly.

josevalim commented 6 years ago

@arjan do you reproduce it locally? If not, which OS are you on?

arjan commented 6 years ago

Yes locally, laptop w/ Arch linux

arjan commented 6 years ago

Checking if downgrading to OTP 20.3 fixes it...