Hackney pool overload - Githubissues

grepz commented 6 years ago

Hello,

It is possible to overload hackney pool and render it unresponsive or even crash erlang node. The case is like this:

Some process sending simple http requests on a server that can handle not more then N RPS with rate N * k(k >= 2) At some point hackney pool queue size begins to grow indefinitely(with my numbers its ~300 rps of receiving server and 600-700 rps for client).

Process info for a pool shows this:

[{current_function,{queue,filter_r,2}},
 {initial_call,{proc_lib,init_p,5}},
 {status,runnable},
 {message_queue_len,305621},
 {messages,[{'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",
                                           80,hackney_tcp},
                                          #Ref<0.0.1835010.103289>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1572872.124197>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1835009.101047>}},
            {'DOWN',#Ref<0.0.1572866.107847>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572868.125051>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572869.178066>,request,<0.15587.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572865.99206>,request,<0.15586.32>,normal},
            {'DOWN',#Ref<0.0.1572865.158565>,request,<0.15586.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572867.214897>,request,<0.15586.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572866.107844>,request,<0.15584.32>,
                    normal},
            {'DOWN',#Ref<0.0.1572869.91611>,request,<0.15584.32>,normal},
            {'DOWN',#Ref<0.0.1572872.6542>,request,<0.15584.32>,normal},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1835011.64499>}},
            {'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
                                           hackney_tcp},
                                          #Ref<0.0.1572872.124201>}},
            {'DOWN',#Ref<0.0.1572867.71002>,request,<0.15585.32>,normal},
            {'DOWN',#Ref<0.0.1572871.14102>,request,<0.15585.32>,normal},
            {'DOWN',#Ref<0.0.1572868.211148>,request,<0.15585.32>,
                    normal},
            {'$gen_cast',{checkout_cancel,{...},...}},
            {'$gen_cast',{checkout_cancel,...}},
            {'$gen_call',{...},...},
            {'$gen_cast',...},
            {...}|...]},
 {links,[<0.982.0>]},
 {dictionary,[{'$ancestors',[hackney_sup,<0.981.0>]},
              {'$initial_call',{hackney_pool,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.980.0>},
 {total_heap_size,25188109},
 {heap_size,2984878},
 {stack_size,8939},
 {reductions,7087623884},
 {garbage_collection,[{min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,199}]},
 {suspending,[]}]

Pool configuration looks like this:

  max_conn => 100,
  timeout => 3000

Since message queue size is huge, pool becomes completely unresponsive.

My proposal: maybe its a good idea to set up some back pressure or introduce some defensive mechanism that will allow to avoid pool lock.

Also it seems a bit unclear when hackney_pool answers {error, connect_timeout} for gen_server:call timeout to a pool. Forces confusion in case described above. Connection itself has not been even tried to be established. Since gen_server timeouted its call.

seanmcevoy commented 5 years ago

i noticed similar issues and when testing have alieviated a lot of them with this fix: https://github.com/benoitc/hackney/pull/540 it's not a complete fix but it definitely helps the symptoms here @benoitc

indrekj commented 5 years ago

I'm not sure if it's the same problem but my pools are getting stuck as well. I'm using hackney 1.15.0. Running get stats on the pool returns:

iex(transporter@10.0.72.101)14> :hackney_pool.get_stats(:"client-logger")
[
  name: :"client-logger",
  max: 400,
  in_use_count: 0,
  free_count: 0,
  queue_count: 13961
]

in_use_count is 0 but the queue count is huge.

benoitc / hackney

Hackney pool overload #510