basho / bitcask

because you need another a key/value storage engine
1.29k stars 173 forks source link

Repeated crashes during merge operations [JIRA: RIAK-1903] #217

Open binarytemple opened 9 years ago

binarytemple commented 9 years ago

This is a 1.4.1 (bitcask 1.6.3) deployment. Seeing repeated crashes during merge operations. Couldn't find any similar existing issues. No expectation of fix in this version, but creating for identification purposes/in case it exists in 1.7.

2015-06-20 05:10:24 =ERROR REPORT====
** Generic server <0.15562.5572> terminating 
** Last message in was {'DOWN',#Ref<0.0.242202.167542>,process,<0.15503.5572>,normal}
** When Server state == {state,undefined,undefined}
** Reason for termination == 
** {{badmatch,{error,badarg}},[{bitcask_file,handle_info,2,[{file,"src/bitcask_file.erl"},{line,170}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
2015-06-20 05:10:24 =CRASH REPORT====
  crasher:
    initial call: bitcask_file:init/1
    pid: <0.15562.5572>
    registered_name: []
    exception exit: {{{badmatch,{error,badarg}},[{bitcask_file,handle_info,2,[{file,"src/bitcask_file.erl"},{line,170}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,747}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
    ancestors: [<0.15503.5572>]
    messages: []
    links: []
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 610
    stack_size: 24
    reductions: 423
  neighbours:
hcs42 commented 9 years ago

This failure is happens in this line of bitcask.erl because we fail to close a bitcask file. In 1.4.1, this line looks like this:

ok = file:close(Fd),

The ok was modified to _ by @slfritchie and @engelsanchez in this commit with the following commit message:

Buildbot isn't quite right in the head: I've no idea why I can run this test 3000x without failure, but Buildbot fails

bitcask_qc: merge2_test...
=ERROR REPORT==== 5-May-2014::04:56:18 ===
** Generic server <0.2324.0> terminating
** Last message in was {'DOWN',#Ref<0.0.0.50927>,process,<0.2317.0>,normal}
** When Server state == {state,undefined,undefined}
** Reason for termination ==
** {{badmatch,{error,badarg}},
    [{bitcask_file,handle_info,2,[{file,"src/bitcask_file.erl"},{line,193}]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}

Fix: Really, we're stopping, so change the pattern match to '_' so we don't care.

@slfritchie, @engelsanchez, do you think that if this file:close returns {error, badarg} in production, that is a problem? Thank you.

engelsanchez commented 9 years ago

Sounds like the file has already been explicitly closed before getting the 'DOWN' message. So this is called: https://github.com/basho/bitcask/blob/1b5ad684c1845f1cf7e8dd4f6fa5fe45e5baf3be/src/bitcask_file.erl#L150

And then we try to close the same already closed file descriptor and get a badarg. It seem it would be more robust to change the fd to undefined after closing it, and checking for undefined in the file:close for the 'DOWN' message.

hcs42 commented 9 years ago

Thank you, I'm happy with closing the ticket then.

engelsanchez commented 9 years ago

Leave it open @hcs42. It's a small race worth fixing.