basho / riak_kv

Riak Key/Value Store
Apache License 2.0
650 stars 233 forks source link

bitcask backend can fail to start due to timeout race with merger worker #592

Closed evanmcc closed 11 years ago

evanmcc commented 11 years ago

see https://github.com/basho/bitcask/issues/95 for info on race.

First we see this:

2013-06-29 10:32:02.231 [error] <0.24133.405>@riak_kv_bitcask_backend:start:113 Failed to start bitcask backend: timeout
2013-06-29 10:32:02.236 [error] <0.24133.405>@riak_kv_vnode:init:355 Failed to start riak_kv_multi_backend Reason: [{riak_kv_bitcask_backend,timeout}]
2013-06-29 10:32:02.238 [notice] <0.24133.405>@riak:stop:46 "backend module failed to start."
2013-06-29 10:32:02.239 [error] <0.24133.405> CRASH REPORT Process <0.24133.405> with 0 neighbours exited with reason: [{riak_kv_bitcask_backend,timeout}] in gen_fsm:init_it/6 line 371
2013-06-29 10:32:02.242 [error] <0.186.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.24133.405> exit with reason [{riak_kv_bitcask_backend,timeout}] in context child_terminated
2013-06-29 10:32:02.250 [error] <0.286.0> gen_server riak_core_vnode_manager terminated with reason: no match of right hand value {error,[{riak_kv_bitcask_backend,timeout}]} in riak_core_vnode_manager:get_vnode/3 line 517
2013-06-29 10:32:02.285 [error] <0.286.0> CRASH REPORT Process riak_core_vnode_manager with 0 neighbours exited with reason: no match of right hand value {error,[{riak_kv_bitcask_backend,timeout}]} in riak_core_vnode_manager:get_vnode/3 line 517 in gen_server:terminate/6 line 747
2013-06-29 10:32:02.288 [error] <0.184.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.286.0> exit with reason no match of right hand value {error,[{riak_kv_bitcask_backend,timeout}]} in riak_core_vnode_manager:get_vnode/3 line 517 in context child_terminated
2013-06-29 10:32:02.292 [error] <0.3876.0> CRASH REPORT Process proxy_riak_kv_vnode_51380916937414555718098294900181824909778878464 with 0 neighbours exited with reason: {{{badmatch,{error,[{riak_kv_bitcask_backend,timeout}]}},[{riak_core_vnode_manager,get_vnode,3,[{file,"src/riak_core_vnode_manager.erl"},{line,517}]},{riak_core_vnode_manager,handle_call,3,[{file,"src/riak_core_vnode_manager.erl"},{line,273}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},{gen_server,call,[riak_core_vnode_manager,{51380916937414555718098294900181824909778878464,riak_kv_vnode,get_vnode},infinity]}} in gen_server:call/3 line 188
2013-06-29 10:32:02.295 [error] <0.198.0> Supervisor riak_core_vnode_proxy_sup had child {riak_kv_vnode,51380916937414555718098294900181824909778878464} started with riak_core_vnode_proxy:start_link(riak_kv_vnode, 51380916937414555718098294900181824909778878464) at <0.3876.0> exit with reason {{{badmatch,{error,[{riak_kv_bitcask_backend,timeout}]}},[{riak_core_vnode_manager,get_vnode,3,[{file,"src/riak_core_vnode_manager.erl"},{line,517}]},{riak_core_vnode_manager,handle_call,3,[{file,"src/riak_core_vnode_manager.erl"},{line,273}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},{gen_server,call,[riak_core_vnode_manager,{51380916937414555718098294900181824909778878464,riak_kv_vnode,get_vnode},infinity]}} in context child_terminated

Followed by:

2013-06-29 10:32:18.846 [info] <0.24351.405> Merged ["/var/lib/riak/bucket/51380916937414555718098294900181824909778878464",[{data_root,"/var/lib/riak/bucket"},{expiry_secs,3888000},{read_write,true}],["/var/lib/riak/bucket/51380916937414555718098294900181824909778878464/8.bitcask.data","/var/lib/riak/bucket/51380916937414555718098294900181824909778878464/7.bitcask.data","/var/lib/riak/bucket/51380916937414555718098294900181824909778878464/6.bitcask.data","/var/lib/riak/bucket/51380916937414555718098294900181824909778878464/5.bitcask.data","/var/lib/riak/bucket/51380916937414555718098294900181824909778878464/4.bitcask.data","/var/lib/riak/bucket/51380916937414555718098294900181824909778878464/3.bitcask.data","/var/lib/riak/bucket/51380916937414555718098294900181824909778878464/2.bitcask.data","/var/lib/riak/bucket/51380916937414555718098294900181824909778878464/1.bitcask.data"]] in 21.117164 seconds.

which suggests a race between the two.

cc @jcapricebasho @bsparrow435

evanmcc commented 11 years ago

Closing, all of the solutions for this one look to live most cleanly in bitcask itself.