basho / bitcask

because you need another a key/value storage engine
1.29k stars 173 forks source link

Startup crash trying to read key from corrupted data file [JIRA: RIAK-1819] #212

Closed engelsanchez closed 9 years ago

engelsanchez commented 9 years ago

On startup, when a hintfile is not available or does not pass the CRC check, Bitcask loads keys directly from the data file. When it does, it is not validating each record in the data file against its CRC. This changed as part of an optimization introduced last year that tried to skip over values when reading the data file only to read keys. Unfortunately this means that it can not use the CRC to verify the validity of the key or value, and may end up with crash errors like the one below. Since this is only an optimization, and it only applies when hintfiles are not available, I propose removing it to handle corruption properly.

It may be worth to also protect the key conversion code with a try/catch to also protect from loading a key in an unhandle format. This could happen as part of a downgrade or similar situation.

[error] <0.730.0> scan_key_files: error function_clause @ [
{riak_kv_bitcask_backend,key_transform_to_1,[<<>>],[{file,"src/riak_kv_bitcask_back
end.erl"},{line,99}]},{bitcask,'-scan_key_files/5-fun-0-',7,[{file,"src/bitcask.erl
"},{line,1188}]},{bitcask_fileops,fold_keys_int_loop,5,[{file,"src/bitcask_fileops.
erl"},{line,595}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.er
l"},{line,720}]},{bitcask_fileops,fold_keys_loop,4,[{file,"src/bitcask_fileops.erl"
},{line,575}]},{bitcask,scan_key_files,5,[{file,"src/bitcask.erl"},{line,1196}]},{b
itcask,init_keydir_scan_key_files,4,[{file,"src/bitcask.erl"},{line,1289}]},{bitcas
k,init_keydir,4,[{file,"src/bitcask.erl"},{line,1241}]}]
krestenkrab commented 9 years ago

Which version of Riak introduced this issue? We have a Riak installation on some apparently very unreliable hardware, where this has bitten us a couple of times over the last month. I'd like to downgrade that setup to a "pre-optimization" version.

engelsanchez commented 9 years ago

@krestenkrab It seems I was wrong about the origin of the lack of CRC checks when only reading keys from data files instead of hintfiles. The optimization I was thinking of introduced variable chunk sizes while reading the file, but the code all the way back to pre 1.4 was already skipping the value and not computing the CRC:

https://github.com/basho/bitcask/blob/1.5.0/src/bitcask_fileops.erl#L370

So we have been vulnerable to corruption in that scenario for a long time apparently.

evanmcc commented 9 years ago

IIRC, this just preserves the original behavior, so I am not sure there is any before.

engelsanchez commented 9 years ago

This was fixed in https://github.com/basho/bitcask/pull/214, included in Bitcask 1.7.2 and shipped with Riak 2.0.6. It will ship in the upcoming 2.1 point release.

oleksiyk commented 8 years ago

I got the same error when one of the Riak 2.1.1 nodes went down (just Riak server went down, not the server itself) and now I can't start Riak with the following error:

2016-02-28 12:41:57.134 [warning] <0.4207.0> Hintfile '/var/lib/riak/fs_chunks/422465317040964124793252646957050560369293000704/5.bitcask.hint' invalid
2016-02-28 12:41:57.164 [error] <0.4207.0> scan_key_files: error function_clause @ [{riak_kv_bitcask_backend,key_transform_to_1,[<<>>],[{file,"src/riak_kv_bitcask_backend.erl"},{line,99}]},{bitcask,'-scan_key_files/5-fun-0-',7,[{file,"src/bitcask.erl"},{line,1188}]},{bitcask_fileops,fold_keys_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,595}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold_keys_loop,4,[{file,"src/bitcask_fileops.erl"},{line,575}]},{bitcask,scan_key_files,5,[{file,"src/bitcask.erl"},{line,1196}]},{bitcask,init_keydir_scan_key_files,4,[{file,"src/bitcask.erl"},{line,1289}]},{bitcask,init_keydir,4,[{file,"src/bitcask.erl"},{line,1241}]}]

I've updated Riak to 2.1.3 on the failed node and it started successfully.