elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
309 stars 112 forks source link

Timeouts on Poolboy.checkout #127

Closed Coffei closed 6 years ago

Coffei commented 6 years ago

Hi, I have recently started seeing the below error in my app

14:39:19.261 [error] GenServer AssetMap.Storage.Stats terminating
** (stop) exited in: :gen_server.call(#PID<0.6836.0>, {:checkout, #Reference<0.3038416112.2899312641.205728>, true, 900000}, 5000)
    ** (EXIT) time out
    (db_connection) lib/db_connection/poolboy.ex:112: DBConnection.Poolboy.checkout/3
    (db_connection) lib/db_connection.ex:928: DBConnection.checkout/2
    (db_connection) lib/db_connection.ex:750: DBConnection.run/3
    (db_connection) lib/db_connection.ex:592: DBConnection.prepare_execute/4
    (ecto) lib/ecto/adapters/postgres/connection.ex:86: Ecto.Adapters.Postgres.Connection.execute/4
    (ecto) lib/ecto/adapters/sql.ex:256: Ecto.Adapters.SQL.sql_call/6
    (stdlib) timer.erl:166: :timer.tc/1
    (utils) lib/stats.ex:32: Utils.Stats.timefun/2
Last message: :report_stats

It started appearing after I upgraded from Fedora 27 to 28, before the upgrade I never saw such error. The code and dependencies are the same. Still using roughly the same Erlang and Elixir versions, now using Erlang 20.3 and Elixir 1.6.5, I also tried Elixir 1.6.0 and a couple of versions in between - all behave the same. My guess is that it's something in the system that changed versions.

The error occurs quite randomly, some DB queries go through while some time out. It seems to affect my stats GenServer the most, the stats GenServer is running in background and executes COUNT(*) on some of my tables every 5s. It seems to affect 1-2 in every 10 requests. Note the queries are not intensive in any way, when executed directly they return in under 100ms.

The database is Postgres running in a docker. Although I created new containers the image itself hasn't changed, so I assume the database side of things remained intact.

My colleagues are not experiencing the issue, although none of them run Fedora 28. The app is unfortunately not open source.

I am not quite sure where to start debugging this. Could you provide some pointers where to start? Debug logs provide no further information.

arjan commented 6 years ago

This is the process info of the worker that hangs:

[       
  current_function: {:prim_inet, :recv0, 3},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 2,
  links: [#PID<0.198.0>, #Port<0.15>, #PID<0.214.0>, #PID<0.197.0>],
  dictionary: [
    "$initial_call": {DBConnection.Connection, :init, 1},
    "$ancestors": [#PID<0.198.0>, TestPool, TimeoutReproducer.Supervisor,
     #PID<0.195.0>]
  ],
  trap_exit: false,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.194.0>,
  total_heap_size: 2208,
  heap_size: 1598,
  stack_size: 39,
  reductions: 4672,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 12
  ],
  suspending: []
]
iex(7)> Process.info(pid(0,201,0), :current_stacktrace)
{:current_stacktrace,
 [
   {:prim_inet, :recv0, 3, []},
   {Postgrex.Protocol, :msg_recv, 4,
    [file: 'lib/postgrex/protocol.ex', line: 1985]},
   {Postgrex.Protocol, :ping_recv, 4,
    [file: 'lib/postgrex/protocol.ex', line: 1734]},
   {DBConnection.Connection, :handle_info, 2,
    [file: 'lib/db_connection/connection.ex', line: 373]},
   {Connection, :handle_async, 3, [file: 'lib/connection.ex', line: 810]},
   {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 637]},
   {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 711]},
   {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}
 ]}

Have to go now, will dig further tonight.

luisgabrielroldan commented 6 years ago

I'm having the same issue. Downgrades are not working for me. I tried downgrading OTP to 20 and older elixir versions but still happens.

I used asdf to setup the older versions that I know it was working fine on my project:

Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

Elixir 1.5.2

... and still happens.

I'm using Arch Linux too... This is happening after a system upgrade (I did it on Saturday but I have been putting it off for weeks).

pera commented 6 years ago

@josevalim I just created a simple Phoenix application that reproduce this issue on my Arch Linux laptop: https://gitlab.com/peramides/db_connection_issue127

This is the error message I get after refreshing the home page a few times:

11:39:37.788 [error] #PID<0.10192.1> running TestWeb.Endpoint (cowboy_protocol) terminated                                                                                                                                                    
Server: 0.0.0.0:4000 (http)
Request: GET /
** (exit) exited in: :gen_server.call(#PID<0.2821.0>, {:checkout, #Reference<0.2314789953.763363332.177259>, true, 15000}, 5000)
    ** (EXIT) time out
=ERROR REPORT==== 25-Jun-2018::11:39:37.785472 ===
Ranch listener 'Elixir.TestWeb.Endpoint.HTTP' had connection process started with cowboy_protocol:start_link/4 at <0.10192.1> exit with reason: {{timeout,{gen_server,call,[<0.2821.0>,{checkout,#Ref<0.2314789953.763363332.177259>,true,15000},5000]}},{'Elixir.TestWeb.Endpoint',call,[#{'__struct__' => 'Elixir.Plug.Conn',adapter => {'Elixir.Plug.Adapters.Cowboy.Conn',{http_req,#Port<0.41287>,ranch_tcp,keepalive,<0.10192.1>,<<"GET">>,'HTTP/1.1',{{0,0,0,0,0,65535,32512,1},54514},<<"0.0.0.0">>,undefined,4000,<<"/">>,undefined,<<>>,undefined,[],[{<<"host">>,<<"0.0.0.0:4000">>},{<<"user-agent">>,<<"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0">>},{<<"accept">>,<<"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8">>},{<<"accept-language">>,<<"en-US,en;q=0.5">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"dnt">>,<<"1">>},{<<"connection">>,<<"keep-alive">>},{<<"upgrade-insecure-requests">>,<<"1">>},{<<"cache-control">>,<<"max-age=0">>}],[{<<"connection">>,[<<"keep-alive">>]}],undefined,[],waiting,<<>>,undefined,false,waiting,[],<<>>,#Fun<Elixir.Plug.Adapters.Cowboy.1.97723902>}},assigns => #{},before_send => [],body_params => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => body_params},cookies => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => cookies},halted => false,host => <<"0.0.0.0">>,method => <<"GET">>,owner => <0.10192.1>,params => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => params},path_info => [],path_params => #{},port => 4000,private => #{},query_params => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => query_params},query_string => <<>>,remote_ip => {0,0,0,0,0,65535,32512,1},req_cookies => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => cookies},req_headers => [{<<"host">>,<<"0.0.0.0:4000">>},{<<"user-agent">>,<<"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0">>},{<<"accept">>,<<"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8">>},{<<"accept-language">>,<<"en-US,en;q=0.5">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"dnt">>,<<"1">>},{<<"connection">>,<<"keep-alive">>},{<<"upgrade-insecure-requests">>,<<"1">>},{<<"cache-control">>,<<"max-age=0">>}],request_path => <<"/">>,resp_body => nil,resp_cookies => #{},resp_headers => [{<<"cache-control">>,<<"max-age=0, private, must-revalidate">>}],scheme => http,script_name => [],secret_key_base => nil,state => unset,status => nil},[]]}}                                                                                

Sometimes this happens every couple of requests, but if it doesn't you can try pressing Ctrl-R on your browser for a few seconds until you see the error on the terminal window.

arjan commented 6 years ago

Weird, I cant reproduce it on OTP 20. Don't think the Elixir version has much to do with it, BTW.

luisgabrielroldan commented 6 years ago

@arjan It's happening to me on OTP 20 and 21. As you said this doesn't looks like an Elixir bug, I think is related to a new library release...

The only way I found to continue working today was edit my mirrorlist file and use an snapshot from 15-04-2018 (maybe a newer one could work too):

Server=https://archive.archlinux.org/repos/2018/04/15/$repo/os/$arch
pera commented 6 years ago

@josevalim I believe I found the issue with your testing environment: your VM was probably using one single core (this is the default in VirtualBox, for instance). I created an Arch Linux instance and wasn't able to reproduce the issue until I changed the number of CPUs from 1 to 4.

Let me know if you need this appliance.

arjan commented 6 years ago

@pera would be nice if you could share it!

arjan commented 6 years ago

Issue update: we narrowed the stuck db workers down to the prim_inet:recv0 function in OTP where it receives on {inet_async, S, Ref _} but the Ref variable is off-by-one with a inet_async message in the processes inbox.

josevalim commented 6 years ago

@arjan Btw, did you check if the port that we get stuck on is still alive? Something like this would do it:

iex(3)> Port.monitor :erlang.list_to_port('#Port<0.64>')
#Reference<0.1283475765.2787377154.58319>
iex(4)> flush
{:DOWN, #Reference<0.1283475765.2787377154.58319>, :port, #Port<0.64>, :noproc}
:ok
michalmuskala commented 6 years ago

@arjan that's a great find. This confirms it's an issue in OTP itself and we should report it to https://bugs.erlang.org/ (unfortunately it's down right now 😩).

arjan commented 6 years ago

@josevalim will do that to confirm. @michalmuskala yes I'm first trying to find a way to reproduce it reliably

arjan commented 6 years ago

OK it seems to have to do with a compiler gcc optimization. If I compile inet_drv.c without optimization (-O0), it does not happen, with -O3 (the default) it does happen. (with -O2 it happens as well, with -O1 it does not, so there is the cutoff point).

GCC version: 8.1.1 20180531

It would be nice if somebody could reproduce my findings :-)

arjan commented 6 years ago

@josevalim I got it reproduced on a digitalocean machine (Ubuntu latest, GCC 8.1), I mailed you the details. If somebody else wants the access to the machine to investigate, ping me.

nixpulvis commented 6 years ago

I've downloaded both OTP and Elixir from GitHub on master, and compiled them each from source. First OTP, following the instructions in the README. I've confirmed that it worked with erl. Then after prepending my $PATH with it's bin directory, I compiled Elixir from source and confirmed it worked with iex, ensuring the versions were updated from my system's installation. Then as I did with Erlang/OTP, I prepended it's bin to my $PATH.

With this I tested https://github.com/Coffei/timeout_reproducer, and verified the issue still occurs. Then I attempted to set the -O0 flag by changing OPT_LEVEL inside erts/emulator/Makefile.in is this correct? If not what Makefile value needs to be changed. With this updated compilation I can still reproduce the issue.

P.S. I didn't recompile Elixir in between compilations of Erlang/OTP. Is this needed? P.P.S My version of GCC is 8.1.0.

Thoughts @arjan?

arjan commented 6 years ago

@nixpulvis thanks for the details. Elixir doesnt need to recompiled indeed. What I did was CFLAGS=-O0 ./configure ... and the CFLAGS=-O0 make -j4. Followd by a new make install of course.

Also checking the log output and just recompiling inet_drv.c followed by a make install again did it for me.

Full gcc commandline for reference:

gcc -Werror=undef -Werror=implicit -Werror=return-type  -g -O3 -I/home/arjan/src/otp/erts/x86_64-unknown-linux-gnu   -fno-tree-copyrename  -D_GNU_SOURCE  -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS   -DLIBSCTP= -Ix86_64-unknown-linux-gnu/opt/smp -Ibeam -Isys/unix -Isys/common -Ix86_64-unknown-linux-gnu -Ipcre -Ihipe -I../include -I../include/x86_64-unknown-linux-gnu -I../include/internal -I../include/internal/x86_64-unknown-linux-gnu -Idrivers/common -Idrivers/unix -c drivers/common/inet_drv.c -o obj/x86_64-unknown-linux-gnu/opt/smp/inet_drv.o

(run from the erts/emulator directory in the OTP source tree)

arjan commented 6 years ago

changing OPT_LEVEL seems also OK btw. I will doublecheck this, I might be wrong as well. Mailed you the server details, dont have much time tonight anymore.

arjan commented 6 years ago

@Coffei can you check whether your OTP is compiled with GCC 8.1? It might be related to the compiler, not to the OTP version…

You can check it like this:

# strings /usr/lib/erlang/erts-10.0/bin/beam.smp|grep GCC
nixpulvis commented 6 years ago

@arjan I've rerun everything locally following your commands more directly, still getting errors, though they might be happening less. It's really hard to tell. I'm personally not 100% sure how Elixir even integrates with Erlang, so I'm flying a bit blind here (just being honest). It could be the case I didn't tell Elixir to use the correct version of Erlang.

pera commented 6 years ago

@arjan here is the VirtualBox appliance with Arch Linux: https://drive.google.com/file/d/1U8zcbdWYZRwvnKoJ7E6TAjEKM2Pq23Ah/view Login with root and then go to ~/db_connection_issue127, there you will find a README file with instructions to reproduce this issue. After a few tries you should see something like this: screenshoot

netsudo commented 6 years ago

@arjan

Experiencing the same issue as well, Erlang/OTP 20 (although I've just recently downgraded from OTP 21 to try to mitigate this issue) with Elixir 1.6.5. Running strings gave me

`#define ETHR_GCC_HAVE_DW_CMPXCHG_ASM_SUPPORT 1 / #undef ETHR_GCC_HAVE_SSE2_ASM_SUPPORT /

define ETHR_HAVE_GCC_ASM_ARM_DMB_INSTRUCTION 0

define ETHR_HAVE_GCC___ATOMIC_BUILTINS 1

/ #undef ETHR_PREFER_GCC_NATIVE_IMPLS /

define ETHR_TRUST_GCC_ATOMIC_BUILTINS_MEMORY_BARRIERS 0

GCC: (GNU) 8.1.0 GCC: (GNU) 8.1.1 20180531 `

My project never experiences issue on my desktop though, only on my laptop. I'm running Arch on both machines.

Coffei commented 6 years ago

@arjan Yes, even though I personally downgraded to GCC 7.3 (had some problems with 8.1), my OTPs seem to be compiled with 8.1.

arjan commented 6 years ago

Thanks for all the info, I'm writing a bug report for the OTP team now.

josevalim commented 6 years ago

Thank you @pera! I was able to reproduce it in your VM with @Coffei's app.

arjan commented 6 years ago

I created the but report: https://bugs.erlang.org/browse/ERL-654

@josevalim On the digitalocean machine I compiled OTP with GCC 7 and then the bug doesnt occur; if it is compiled with GCC 8, it does.

josevalim commented 6 years ago

@arjan I can also reproduce it on MacOS X with GCC 8.1. :D

pera commented 6 years ago

I tried John Högberg's patch on OTP's master branch and the race condition seems completely fixed :)

Coffei commented 6 years ago

Wonderful news, seems like we have a fix coming to OTP soon. Thanks to everybody involved!

josevalim commented 6 years ago

Thank you @Coffei for getting this started, providing information and a minimal app!

josevalim commented 6 years ago

This is fixed on OTP 21.0.2. Thanks @Coffei for guiding this and everyone who helped with information, debugging, etc. 🎉 I am not familiar with how archlinux works but it would be really appreciated if somebody makes sure that the Erlang/OTP package in their repository gets updated too.

tcoopman commented 6 years ago

It's already flagged as out of date on the archlinux, so I guess it won't take long for the new version to drop: https://www.archlinux.org/packages/community/x86_64/erlang/

ktec commented 6 years ago

Seriously :+1: :100: :heart: :blue_heart: :green_heart: :yellow_heart: :rabbit: :rabbit: :+1: to all'a you. Thank you :1st_place_medal:

dimitarvp commented 5 years ago

Just found this and it helped me. Thanks so much to everybody!

kamciokodzi commented 4 years ago

Hello! Are there any chances that it stills occurs or came back on otp 22.2.6? We have encountered this problem recently and it's causing a lot of problems - even Application downtimes. It is happening quiet randomly I would said but I'm not 100% sure. Everything seems to be working fine and then :boom: suddenly response times are going up from 20-50ms to even 20000ms. I have checked for slow queries but didn't found any and I'm quiet concerned. Queue time was high but it was in the same moment when crash happend so I'm not sure if that was a reason.

We are building release on Docker from erlang:22.2.6 image with elixir 1.10.1 and then running release on debian:buster-slim. Before that we ran app on erlang:22.0 and elixir 1.9.0 for building release and ran it on debian:stretch-slim but it was also a problem so I'm not sure if that's something with our code or some OTP issue again 😢

I would love to hear some advices from you guys.

exit: ** (exit) exited in: :gen_server.call(Postgres.Repo.Pool, {:checkout, #Reference<0.1066510293.4072931345.103161>, true}, 5000)
    ** (EXIT) time out
  File "gen_server.erl", line 223, in :gen_server.call/3
  File "/app/deps/poolboy/src/poolboy.erl", line 63, in :poolboy.checkout/3
  File "lib/db_connection/poolboy.ex", line 41, in DBConnection.Poolboy.checkout/2
  File "lib/db_connection.ex", line 928, in DBConnection.checkout/2
  File "lib/db_connection.ex", line 750, in DBConnection.run/3
  File "lib/db_connection.ex", line 592, in DBConnection.prepare_execute/4
  File "lib/ecto/adapters/postgres/connection.ex", line 86, in Ecto.Adapters.Postgres.Connection.execute/4
  File "lib/ecto/adapters/sql.ex", line 256, in Ecto.Adapters.SQL.sql_call/6
michalmuskala commented 4 years ago

Just as a data point db_connection does not use poolboy since version 2.0 released around 1.5 years ago.

kamciokodzi commented 4 years ago

@michalmuskala yea i know but we are still using Ecto 2.2 which is using poolboy 1.5 in that specific version. So basically are you saying that the only way to fix that is to bump ecto?