elixir-plug / plug_cowboy

Plug adapter for the Cowboy web server
Other
246 stars 49 forks source link

Plug.Cowboy.Drainer shut down abnormally #92

Open dantswain opened 1 year ago

dantswain commented 1 year ago

Hi!

My app has been shutting down seemingly randomly. This only happens on my local machine (not on other devs, not in production environments) so I suspect it has something to do with my set up.

There are no logs when my app shuts down. It doesn't seem to correlate to any particular behavior. This has been happening for at least a couple weeks with regular daily development happening.

I finally got a clue by enabling SASL logging:

[error] Child Plug.Cowboy.Drainer of Supervisor MyApp.Endpoint shut down abnormally
** (exit) killed
Pid: #PID<0.1011.0>
Start Call: Plug.Cowboy.Drainer.start_link([refs: [MyAppWeb.Endpoint.HTTP]])
Restart: :permanent
Shutdown: 5000
Type: :worker

[notice] Application my_app exited: shutdown

Based on looking at what Plug.Cowboy.Drainer does, this seems like the error could be a red herring? Any ideas how I could get more diagnostic information? I highly suspect this is a memory limit or running out of file descriptors or something else specific to my environment but I can't seem to get any more clues.

I'm using plug_cowboy 2.6.0 with Phoenix 1.7.0-rc.2 on Elixir 1.14.3, OTP 25

josevalim commented 1 year ago

Can you try running with config :logger, handle_sasl_reports: true? My suspicion is that another child in your supervision tree is crashing causing the tree to restart. During restart, it has to shut down the current children. Cowboy cannot drain everything in 5000ms (probably because a browser is open) and then it is killed after 5 seconds.

dantswain commented 1 year ago

This is with config :logger, handle_sasl_reports: true :grimacing:

iex(4)> Application.get_env(:logger, :handle_sasl_reports)
true

The log message just before this is (in the most recent crash) [info] Sent 200 in 36ms. Open to trying anything else!

josevalim commented 1 year ago

Hrm, that's very weird then! What could be killing it? Maybe you can play with Erlang's dbg or tracer and trace the messages between processes?

dantswain commented 1 year ago

I agree!

My hunch is I'm running out of file descriptors or something at a system level? I don't know how else it would be so quiet...

I've attached a tracer to my drainer process, so maybe that will show something. I don't think I can get away with tracing everything since it can be hours between crashes and I need to be able to work in the meantime :sweat:

dantswain commented 1 year ago

OK that got a little more info!

(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83170,904349})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83170,904389})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181956>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83170,
                                                                                 904398})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83170,904448})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83171,905187})
(<0.763.0>) << timeout (Timestamp: {1677,83171,905236})
(<0.763.0>) <0.661.0> ! {'$gen_call',{<0.763.0>,
                                      #Ref<0.23370919.4183031816.181957>},
                                     count_children} (Timestamp: {1677,83171,
                                                                  905324})
(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83171,905357})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83171,905403})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181957>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83171,
                                                                                 905417})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83171,905467})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83172,906176})
(<0.763.0>) << timeout (Timestamp: {1677,83172,906224})
(<0.763.0>) <0.661.0> ! {'$gen_call',{<0.763.0>,
                                      #Ref<0.23370919.4183031816.181958>},
                                     count_children} (Timestamp: {1677,83172,
                                                                  906282})
(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83172,906313})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83172,906356})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181958>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83172,
                                                                                 906368})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83172,906414})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83172,911083})
(<0.763.0>) exit killed (Timestamp: {1677,83172,911114})
(<0.763.0>) out_exited 0 (Timestamp: {1677,83172,911131})
[error] Child Plug.Cowboy.Drainer of Supervisor MyAppWeb.Endpoint shut down abnormally
** (exit) killed
Pid: #PID<0.763.0>
Start Call: Plug.Cowboy.Drainer.start_link([refs: [MyAppWeb.Endpoint.HTTP]])
Restart: :permanent
Shutdown: 5000
Type: :worker
[notice] Application my_app exited: shutdown

There's more before that but the pattern repeats. I know this isn't always necessarily repeatable but if I restart the application and get the same pid (<0.763.0>) for the drainer then the process that sent the message, <0.661.0>, appears to be ranch_cons_sup.

josevalim commented 1 year ago

Hrm, this is weird because ranch has no idea about this process unless we are shutting it down!

dantswain commented 1 year ago

Am I reading the trace correctly that the drainer is sending {$gen_call', {... count_children}} to <0.661.0> or is it the other way around? It looks like the drainer could be doing that here: https://github.com/elixir-plug/plug_cowboy/blob/master/lib/plug/cowboy/drainer.ex#L103-L105

josevalim commented 1 year ago

Right, but the point is that we should not be in that piece of the code unless the drainer started to terminate!

dantswain commented 1 year ago

I'm trying to catch it in the act but I think it might be getting triggered by or during the automatic rebuild cycle. Does that ring any bells?

Is the trace saying that the call is timing out? What might cause that to timeout? It might hint at the root problem.

I'll keep digging here but it's sort of hard because it's very distracting compared to the mild inconvenience of having to manually restart periodically :joy: Definitely open to ideas of things to try.

dantswain commented 1 year ago

OK I think I've actually tracked my problem down to another source - https://github.com/joken-elixir/joken_jwks/issues/40

I'm wondering, though, if what I've seen here might hint at a problem in plug_cowboy? The traces make it look like a timeout is happening and then the drainer exits abnormally.

josevalim commented 1 year ago

Right, that was my point since the beginning. This is definitely a secondary failure, something is making your app fail before, but I am not sure why you can't see the first failure! I am not sure if there is a bug here, for all purposes, the shutdown/kill is happening for correct reasons.

einariii commented 1 year ago

I encountered this issue when a project didn't have the .env file used in the original dev environment.

nb. this happened immediately when running mix phx.server, unlike np who seemed to see this after deployment.