Closed Coffei closed 6 years ago
This is the process info of the worker that hangs:
[
current_function: {:prim_inet, :recv0, 3},
initial_call: {:proc_lib, :init_p, 5},
status: :waiting,
message_queue_len: 2,
links: [#PID<0.198.0>, #Port<0.15>, #PID<0.214.0>, #PID<0.197.0>],
dictionary: [
"$initial_call": {DBConnection.Connection, :init, 1},
"$ancestors": [#PID<0.198.0>, TestPool, TimeoutReproducer.Supervisor,
#PID<0.195.0>]
],
trap_exit: false,
error_handler: :error_handler,
priority: :normal,
group_leader: #PID<0.194.0>,
total_heap_size: 2208,
heap_size: 1598,
stack_size: 39,
reductions: 4672,
garbage_collection: [
max_heap_size: %{error_logger: true, kill: true, size: 0},
min_bin_vheap_size: 46422,
min_heap_size: 233,
fullsweep_after: 65535,
minor_gcs: 12
],
suspending: []
]
iex(7)> Process.info(pid(0,201,0), :current_stacktrace)
{:current_stacktrace,
[
{:prim_inet, :recv0, 3, []},
{Postgrex.Protocol, :msg_recv, 4,
[file: 'lib/postgrex/protocol.ex', line: 1985]},
{Postgrex.Protocol, :ping_recv, 4,
[file: 'lib/postgrex/protocol.ex', line: 1734]},
{DBConnection.Connection, :handle_info, 2,
[file: 'lib/db_connection/connection.ex', line: 373]},
{Connection, :handle_async, 3, [file: 'lib/connection.ex', line: 810]},
{:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 637]},
{:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 711]},
{:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}
]}
Have to go now, will dig further tonight.
I'm having the same issue. Downgrades are not working for me. I tried downgrading OTP to 20 and older elixir versions but still happens.
I used asdf to setup the older versions that I know it was working fine on my project:
Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]
Elixir 1.5.2
... and still happens.
I'm using Arch Linux too... This is happening after a system upgrade (I did it on Saturday but I have been putting it off for weeks).
@josevalim I just created a simple Phoenix application that reproduce this issue on my Arch Linux laptop: https://gitlab.com/peramides/db_connection_issue127
This is the error message I get after refreshing the home page a few times:
11:39:37.788 [error] #PID<0.10192.1> running TestWeb.Endpoint (cowboy_protocol) terminated
Server: 0.0.0.0:4000 (http)
Request: GET /
** (exit) exited in: :gen_server.call(#PID<0.2821.0>, {:checkout, #Reference<0.2314789953.763363332.177259>, true, 15000}, 5000)
** (EXIT) time out
=ERROR REPORT==== 25-Jun-2018::11:39:37.785472 ===
Ranch listener 'Elixir.TestWeb.Endpoint.HTTP' had connection process started with cowboy_protocol:start_link/4 at <0.10192.1> exit with reason: {{timeout,{gen_server,call,[<0.2821.0>,{checkout,#Ref<0.2314789953.763363332.177259>,true,15000},5000]}},{'Elixir.TestWeb.Endpoint',call,[#{'__struct__' => 'Elixir.Plug.Conn',adapter => {'Elixir.Plug.Adapters.Cowboy.Conn',{http_req,#Port<0.41287>,ranch_tcp,keepalive,<0.10192.1>,<<"GET">>,'HTTP/1.1',{{0,0,0,0,0,65535,32512,1},54514},<<"0.0.0.0">>,undefined,4000,<<"/">>,undefined,<<>>,undefined,[],[{<<"host">>,<<"0.0.0.0:4000">>},{<<"user-agent">>,<<"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0">>},{<<"accept">>,<<"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8">>},{<<"accept-language">>,<<"en-US,en;q=0.5">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"dnt">>,<<"1">>},{<<"connection">>,<<"keep-alive">>},{<<"upgrade-insecure-requests">>,<<"1">>},{<<"cache-control">>,<<"max-age=0">>}],[{<<"connection">>,[<<"keep-alive">>]}],undefined,[],waiting,<<>>,undefined,false,waiting,[],<<>>,#Fun<Elixir.Plug.Adapters.Cowboy.1.97723902>}},assigns => #{},before_send => [],body_params => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => body_params},cookies => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => cookies},halted => false,host => <<"0.0.0.0">>,method => <<"GET">>,owner => <0.10192.1>,params => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => params},path_info => [],path_params => #{},port => 4000,private => #{},query_params => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => query_params},query_string => <<>>,remote_ip => {0,0,0,0,0,65535,32512,1},req_cookies => #{'__struct__' => 'Elixir.Plug.Conn.Unfetched',aspect => cookies},req_headers => [{<<"host">>,<<"0.0.0.0:4000">>},{<<"user-agent">>,<<"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0">>},{<<"accept">>,<<"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8">>},{<<"accept-language">>,<<"en-US,en;q=0.5">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"dnt">>,<<"1">>},{<<"connection">>,<<"keep-alive">>},{<<"upgrade-insecure-requests">>,<<"1">>},{<<"cache-control">>,<<"max-age=0">>}],request_path => <<"/">>,resp_body => nil,resp_cookies => #{},resp_headers => [{<<"cache-control">>,<<"max-age=0, private, must-revalidate">>}],scheme => http,script_name => [],secret_key_base => nil,state => unset,status => nil},[]]}}
Sometimes this happens every couple of requests, but if it doesn't you can try pressing Ctrl-R on your browser for a few seconds until you see the error on the terminal window.
Weird, I cant reproduce it on OTP 20. Don't think the Elixir version has much to do with it, BTW.
@arjan It's happening to me on OTP 20 and 21. As you said this doesn't looks like an Elixir bug, I think is related to a new library release...
The only way I found to continue working today was edit my mirrorlist file and use an snapshot from 15-04-2018 (maybe a newer one could work too):
Server=https://archive.archlinux.org/repos/2018/04/15/$repo/os/$arch
@josevalim I believe I found the issue with your testing environment: your VM was probably using one single core (this is the default in VirtualBox, for instance). I created an Arch Linux instance and wasn't able to reproduce the issue until I changed the number of CPUs from 1 to 4.
Let me know if you need this appliance.
@pera would be nice if you could share it!
Issue update: we narrowed the stuck db workers down to the prim_inet:recv0
function in OTP where it receives on {inet_async, S, Ref _}
but the Ref
variable is off-by-one with a inet_async
message in the processes inbox.
@arjan Btw, did you check if the port that we get stuck on is still alive? Something like this would do it:
iex(3)> Port.monitor :erlang.list_to_port('#Port<0.64>')
#Reference<0.1283475765.2787377154.58319>
iex(4)> flush
{:DOWN, #Reference<0.1283475765.2787377154.58319>, :port, #Port<0.64>, :noproc}
:ok
@arjan that's a great find. This confirms it's an issue in OTP itself and we should report it to https://bugs.erlang.org/ (unfortunately it's down right now 😩).
@josevalim will do that to confirm. @michalmuskala yes I'm first trying to find a way to reproduce it reliably
OK it seems to have to do with a compiler gcc optimization. If I compile inet_drv.c
without optimization (-O0
), it does not happen, with -O3
(the default) it does happen. (with -O2 it happens as well, with -O1 it does not, so there is the cutoff point).
GCC version: 8.1.1 20180531
It would be nice if somebody could reproduce my findings :-)
@josevalim I got it reproduced on a digitalocean machine (Ubuntu latest, GCC 8.1), I mailed you the details. If somebody else wants the access to the machine to investigate, ping me.
I've downloaded both OTP and Elixir from GitHub on master, and compiled them each from source. First OTP, following the instructions in the README. I've confirmed that it worked with erl
. Then after prepending my $PATH
with it's bin directory, I compiled Elixir from source and confirmed it worked with iex
, ensuring the versions were updated from my system's installation. Then as I did with Erlang/OTP, I prepended it's bin to my $PATH
.
With this I tested https://github.com/Coffei/timeout_reproducer, and verified the issue still occurs. Then I attempted to set the -O0
flag by changing OPT_LEVEL
inside erts/emulator/Makefile.in
is this correct? If not what Makefile value needs to be changed. With this updated compilation I can still reproduce the issue.
P.S. I didn't recompile Elixir in between compilations of Erlang/OTP. Is this needed? P.P.S My version of GCC is 8.1.0.
Thoughts @arjan?
@nixpulvis thanks for the details. Elixir doesnt need to recompiled indeed.
What I did was CFLAGS=-O0 ./configure ...
and the CFLAGS=-O0 make -j4
. Followd by a new make install
of course.
Also checking the log output and just recompiling inet_drv.c
followed by a make install
again did it for me.
Full gcc commandline for reference:
gcc -Werror=undef -Werror=implicit -Werror=return-type -g -O3 -I/home/arjan/src/otp/erts/x86_64-unknown-linux-gnu -fno-tree-copyrename -D_GNU_SOURCE -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS -DLIBSCTP= -Ix86_64-unknown-linux-gnu/opt/smp -Ibeam -Isys/unix -Isys/common -Ix86_64-unknown-linux-gnu -Ipcre -Ihipe -I../include -I../include/x86_64-unknown-linux-gnu -I../include/internal -I../include/internal/x86_64-unknown-linux-gnu -Idrivers/common -Idrivers/unix -c drivers/common/inet_drv.c -o obj/x86_64-unknown-linux-gnu/opt/smp/inet_drv.o
(run from the erts/emulator directory in the OTP source tree)
changing OPT_LEVEL
seems also OK btw. I will doublecheck this, I might be wrong as well. Mailed you the server details, dont have much time tonight anymore.
@Coffei can you check whether your OTP is compiled with GCC 8.1? It might be related to the compiler, not to the OTP version…
You can check it like this:
# strings /usr/lib/erlang/erts-10.0/bin/beam.smp|grep GCC
@arjan I've rerun everything locally following your commands more directly, still getting errors, though they might be happening less. It's really hard to tell. I'm personally not 100% sure how Elixir even integrates with Erlang, so I'm flying a bit blind here (just being honest). It could be the case I didn't tell Elixir to use the correct version of Erlang.
@arjan here is the VirtualBox appliance with Arch Linux: https://drive.google.com/file/d/1U8zcbdWYZRwvnKoJ7E6TAjEKM2Pq23Ah/view Login with root and then go to ~/db_connection_issue127, there you will find a README file with instructions to reproduce this issue. After a few tries you should see something like this:
@arjan
Experiencing the same issue as well, Erlang/OTP 20 (although I've just recently downgraded from OTP 21 to try to mitigate this issue) with Elixir 1.6.5. Running strings gave me
`#define ETHR_GCC_HAVE_DW_CMPXCHG_ASM_SUPPORT 1 / #undef ETHR_GCC_HAVE_SSE2_ASM_SUPPORT /
/ #undef ETHR_PREFER_GCC_NATIVE_IMPLS /
GCC: (GNU) 8.1.0 GCC: (GNU) 8.1.1 20180531 `
My project never experiences issue on my desktop though, only on my laptop. I'm running Arch on both machines.
@arjan Yes, even though I personally downgraded to GCC 7.3 (had some problems with 8.1), my OTPs seem to be compiled with 8.1.
Thanks for all the info, I'm writing a bug report for the OTP team now.
Thank you @pera! I was able to reproduce it in your VM with @Coffei's app.
I created the but report: https://bugs.erlang.org/browse/ERL-654
@josevalim On the digitalocean machine I compiled OTP with GCC 7 and then the bug doesnt occur; if it is compiled with GCC 8, it does.
@arjan I can also reproduce it on MacOS X with GCC 8.1. :D
I tried John Högberg's patch on OTP's master branch and the race condition seems completely fixed :)
Wonderful news, seems like we have a fix coming to OTP soon. Thanks to everybody involved!
Thank you @Coffei for getting this started, providing information and a minimal app!
This is fixed on OTP 21.0.2. Thanks @Coffei for guiding this and everyone who helped with information, debugging, etc. 🎉 I am not familiar with how archlinux works but it would be really appreciated if somebody makes sure that the Erlang/OTP package in their repository gets updated too.
It's already flagged as out of date on the archlinux, so I guess it won't take long for the new version to drop: https://www.archlinux.org/packages/community/x86_64/erlang/
Seriously :+1: :100: :heart: :blue_heart: :green_heart: :yellow_heart: :rabbit: :rabbit: :+1: to all'a you. Thank you :1st_place_medal:
Just found this and it helped me. Thanks so much to everybody!
Hello!
Are there any chances that it stills occurs or came back on otp 22.2.6?
We have encountered this problem recently and it's causing a lot of problems - even Application downtimes. It is happening quiet randomly I would said but I'm not 100% sure. Everything seems to be working fine and then :boom: suddenly response times
are going up from 20-50ms to even 20000ms. I have checked for slow queries but didn't found any and I'm quiet concerned. Queue time was high but it was in the same moment when crash happend so I'm not sure if that was a reason.
We are building release on Docker from erlang:22.2.6
image with elixir 1.10.1
and then running release on debian:buster-slim
. Before that we ran app on erlang:22.0
and elixir 1.9.0
for building release and ran it on debian:stretch-slim
but it was also a problem so I'm not sure if that's something with our code or some OTP issue again 😢
I would love to hear some advices from you guys.
exit: ** (exit) exited in: :gen_server.call(Postgres.Repo.Pool, {:checkout, #Reference<0.1066510293.4072931345.103161>, true}, 5000)
** (EXIT) time out
File "gen_server.erl", line 223, in :gen_server.call/3
File "/app/deps/poolboy/src/poolboy.erl", line 63, in :poolboy.checkout/3
File "lib/db_connection/poolboy.ex", line 41, in DBConnection.Poolboy.checkout/2
File "lib/db_connection.ex", line 928, in DBConnection.checkout/2
File "lib/db_connection.ex", line 750, in DBConnection.run/3
File "lib/db_connection.ex", line 592, in DBConnection.prepare_execute/4
File "lib/ecto/adapters/postgres/connection.ex", line 86, in Ecto.Adapters.Postgres.Connection.execute/4
File "lib/ecto/adapters/sql.ex", line 256, in Ecto.Adapters.SQL.sql_call/6
Just as a data point db_connection
does not use poolboy since version 2.0 released around 1.5 years ago.
@michalmuskala yea i know but we are still using Ecto 2.2 which is using poolboy 1.5
in that specific version.
So basically are you saying that the only way to fix that is to bump ecto?
Hi, I have recently started seeing the below error in my app
It started appearing after I upgraded from Fedora 27 to 28, before the upgrade I never saw such error. The code and dependencies are the same. Still using roughly the same Erlang and Elixir versions, now using Erlang 20.3 and Elixir 1.6.5, I also tried Elixir 1.6.0 and a couple of versions in between - all behave the same. My guess is that it's something in the system that changed versions.
The error occurs quite randomly, some DB queries go through while some time out. It seems to affect my stats GenServer the most, the stats GenServer is running in background and executes
COUNT(*)
on some of my tables every 5s. It seems to affect 1-2 in every 10 requests. Note the queries are not intensive in any way, when executed directly they return in under 100ms.The database is Postgres running in a docker. Although I created new containers the image itself hasn't changed, so I assume the database side of things remained intact.
My colleagues are not experiencing the issue, although none of them run Fedora 28. The app is unfortunately not open source.
I am not quite sure where to start debugging this. Could you provide some pointers where to start? Debug logs provide no further information.