basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

unexplained riak_core_vnode_worker_pool crashes #923

Open russelldb opened 6 years ago

russelldb commented 6 years ago

Reported by NHS SUS team:

2018-07-23 07:22:03 =ERROR REPORT====
** State machine <0.2125.0> terminating 
** Last event in was {work,{fold,#Fun<riak_kv_eleveldb_backend.6.107590332>,#Fun<riak_kv_vnode.53.104106215>},{fsm,{38526933,{1027618338748291114361965898003636498195577569280,'riak@10.239.88.11'}},<18840.27968.4469>}}
** When State == ready
**      Data  == {state,{[],[]},<0.2126.0>,[{<0.2133.0>,#Ref<0.0.31871.81035>,{fsm,{77601312,{1027618338748291114361965898003636498195577569280,'riak@10.239.88.11'}},<18844.22508.3877>},{fold,#Fun<riak_kv_eleveldb_backend.6.107590332>,#Fun<riak_kv_vnode.53.104106215>}},{<0.2128.0>,#Ref<0.0.31871.80638>,{fsm,{933708,{1027618338748291114361965898003636498195577569280,'riak@10.239.88.11'}},<18830.22378.4479>},{fold,#Fun<riak_kv_eleveldb_backend.6.107590332>,#Fun<riak_kv_vnode.53.104106215>}},{<0.2137.0>,#Ref<0.0.22261.147569>,{server,undefined,{<0.29214.4450>,#Ref<0.0.22261.147543>}},{fold,#Fun<riak_kv_eleveldb_backend.9.107590332>,#Fun<riak_kv_vnode.54.104106215>}}],undefined}
** Reason for termination = 
** {timeout,{gen_fsm,sync_send_event,[<0.2126.0>,{checkout,false,5000},5000]}}
2018-07-23 07:22:03 =CRASH REPORT====
  crasher:
    initial call: riak_core_vnode_worker_pool:init/1
    pid: <0.2125.0>
    registered_name: []
    exception exit: {{timeout,{gen_fsm,sync_send_event,[<0.2126.0>,{checkout,false,5000},5000]}},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,622}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
    ancestors: [<0.1242.0>,riak_core_vnode_sup,riak_core_sup,<0.218.0>]
    messages: [{'$gen_event',{work,{fold,#Fun<riak_kv_eleveldb_backend.6.107590332>,#Fun<riak_kv_vnode.53.104106215>},{fsm,{90239418,{1027618338748291114361965898003636498195577569280,'riak@10.239.88.11'}},<18831.29286.4479>}}},{#Ref<0.0.31871.81037>,<0.2129.0>},{'$gen_event',{work,{fold,#Fun<riak_kv_eleveldb_backend.6.107590332>,#Fun<riak_kv_vnode.53.104106215>},{fsm,{14318328,{1027618338748291114361965898003636498195577569280,'riak@10.239.88.11'}},<18830.30450.4482>}}},{'$gen_event',{work,{fold,#Fun<riak_kv_eleveldb_backend.6.107590332>,#Fun<riak_kv_vnode.53.104106215>},{fsm,{118930845,{1027618338748291114361965898003636498195577569280,'riak@10.239.88.11'}},<18844.25830.3877>}}}]
    links: [<0.1242.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 2586
    stack_size: 27
    reductions: 86077309
  neighbours:

Looking at the code, this appears to be a poolboy:checkout/2 call timeout.

russelldb commented 6 years ago

Just adding this note not to bark up an old deadlock tree (https://github.com/basho/riak_core/commit/7a60b6e80a7556f2bfb49021340079bee8bd11a9)