benoitc / hackney

simple HTTP client in Erlang
Other
1.33k stars 427 forks source link

badarg received when using async #267

Open priestjim opened 8 years ago

priestjim commented 8 years ago

On a very high HTTP call volume server, using async to "fire and forget" continuously produces this message in the error logger:

2015-12-09 23:56:17 =ERROR REPORT====
Error in process <0.3450.45> on node '[...]' with exit value:
{badarg,[{erlang,port_info,[#Port<0.148406>,connected],[]},{inet,tcp_controlling_process,2,[{file,"inet.erl"},{line,1522}]},{hackney_stream,stream_loop,4,[{file,"[...]hackney/src/hackney_stream.erl"},{line,85}]},{hackney_stream,init,4,[{file,"[...]hackney/src/hackney_stream.erl"},{line,30}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
2015-12-09 23:56:17 =SUPERVISOR REPORT====
     Supervisor: {local,hackney_sup}
     Context:    child_terminated
     Reason:     shutdown
     Offender:   [{pid,<0.1418.45>},{id,[...]},{mfargs,{hackney_pool,start_link,[[...],[{name,[...]},{timeout,120000},{max_connections,8192}]]}},{restart_type,permanent},{shutdown,10000},{child_type,worker}]

{badarg,[{erlang,port_info,[#Port<0.58183>,connected],[]},{inet,tcp_controlling_process,2,[{file,"inet.erl"},{line,1522}]},{hackney_stream,stream_loop,4,[{file,"[...]/hackney/src/hackney_stream.erl"},{line,85}]},{hackney_stream,init,4,[{file,"[..]/hackney/src/hackney_stream.erl"},{line,30}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

The call that causes this issue is:

hackney:request(post, URL, Headers, Body, [async, {connect_timeout, 10000}, {pool, PoolName}]),
_ = receive _ -> ok end,

Where the custom pool is defined with the following options:

[{timeout, 120000}, {max_connections, 8192}]
priestjim commented 8 years ago

I'm gonna assume hackney is trying to give the socket back to the parent, but the socket is already closed remotely (by the remote server) hence the badarg. This situation should be managed in the same manner as the synchronous call.

benoitc commented 8 years ago

@priestjim

This situation should be managed in the same manner as the synchronous call.

How do you mean? Return an error?

priestjim commented 8 years ago

After further digging, I saw that this error should be getting caught here:

try
    stream_loop(Parent, Owner, Ref, Client#client{parser=Parser,
                                                  response_state=on_status})
catch Class:Reason ->

but it's not. I have no idea why!

jlouis commented 8 years ago

This may not be related to the above at all, but has the stream_loop called proc_lib:hibernate/3 at any point? That would explain the problem, since if the process hibernates, that catch is effectively gone forever from the call stack. I don't know exactly what would happen with the backtrace from get_stracktrace() though, so it could be non-related. I'm somewhat positive that code will never work as intended once the process hibernates, however.

priestjim commented 8 years ago

I think it does in multiple points. Lines 99, 140 and 241 can all be called while in the stream_loop function.

@benoitc is there any critical reason to hibernate at those points? @jlouis you saved the day again :)

jlouis commented 8 years ago

But does it happen in your case that they are called? It could explain the problem you are seeing. Add a helper function to the module which calls hibernate, then use dbg to trace if that function gets called in production, I'd say. This should at least make it feasible to check for the case.

priestjim commented 8 years ago

I am almost sure it does. The problem is that this happens only on extremely high number of requests, trying it locally never triggers the issue. I'll give it a shot though.

benoitc commented 8 years ago

@priestjim all this code should be simpler. I some code on the machine for it. I will try to finish the new_pool branch and do a release until the week-end with it. The general idea is to simplify the loop and the flow between the requester and the stream.

@jlouis i guess hackney_trace when enabled should be able to catch it. It may worth a try.

priestjim commented 8 years ago

Please do, we're dying over here - we got ~350K calls/sec at one point :)

benoitc commented 8 years ago

@priestjim should land on tuesday, i made good progress on it :)

priestjim commented 8 years ago

@benoitc any progress on the new pool arch? Need a hand?

benoitc commented 8 years ago

@priestjim #276 is the ticket to track it. It should land sometimes next week. In the mean time I may have a quick fix about the hard limit.

fogfish commented 5 years ago

Any updates on this issue? I am meeting exactly same issue with version 1.15.1 ?

benoitc commented 5 years ago

hmm should have been closedby #276. There are 2 milestones coming also with a release of the 1.16 next week and the 2.0 that will only be erlang 21 and sup. The 2.0 is a complete rework of the connection handling.

fogfish commented 5 years ago

Are you saying that this bug is closed already? or Will it be closed after 2.0?