Closed engelsanchez closed 9 years ago
Which version of Riak introduced this issue? We have a Riak installation on some apparently very unreliable hardware, where this has bitten us a couple of times over the last month. I'd like to downgrade that setup to a "pre-optimization" version.
@krestenkrab It seems I was wrong about the origin of the lack of CRC checks when only reading keys from data files instead of hintfiles. The optimization I was thinking of introduced variable chunk sizes while reading the file, but the code all the way back to pre 1.4 was already skipping the value and not computing the CRC:
https://github.com/basho/bitcask/blob/1.5.0/src/bitcask_fileops.erl#L370
So we have been vulnerable to corruption in that scenario for a long time apparently.
IIRC, this just preserves the original behavior, so I am not sure there is any before.
This was fixed in https://github.com/basho/bitcask/pull/214, included in Bitcask 1.7.2 and shipped with Riak 2.0.6. It will ship in the upcoming 2.1 point release.
I got the same error when one of the Riak 2.1.1 nodes went down (just Riak server went down, not the server itself) and now I can't start Riak with the following error:
2016-02-28 12:41:57.134 [warning] <0.4207.0> Hintfile '/var/lib/riak/fs_chunks/422465317040964124793252646957050560369293000704/5.bitcask.hint' invalid
2016-02-28 12:41:57.164 [error] <0.4207.0> scan_key_files: error function_clause @ [{riak_kv_bitcask_backend,key_transform_to_1,[<<>>],[{file,"src/riak_kv_bitcask_backend.erl"},{line,99}]},{bitcask,'-scan_key_files/5-fun-0-',7,[{file,"src/bitcask.erl"},{line,1188}]},{bitcask_fileops,fold_keys_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,595}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold_keys_loop,4,[{file,"src/bitcask_fileops.erl"},{line,575}]},{bitcask,scan_key_files,5,[{file,"src/bitcask.erl"},{line,1196}]},{bitcask,init_keydir_scan_key_files,4,[{file,"src/bitcask.erl"},{line,1289}]},{bitcask,init_keydir,4,[{file,"src/bitcask.erl"},{line,1241}]}]
I've updated Riak to 2.1.3 on the failed node and it started successfully.
On startup, when a hintfile is not available or does not pass the CRC check, Bitcask loads keys directly from the data file. When it does, it is not validating each record in the data file against its CRC. This changed as part of an optimization introduced last year that tried to skip over values when reading the data file only to read keys. Unfortunately this means that it can not use the CRC to verify the validity of the key or value, and may end up with crash errors like the one below. Since this is only an optimization, and it only applies when hintfiles are not available, I propose removing it to handle corruption properly.
It may be worth to also protect the key conversion code with a try/catch to also protect from loading a key in an unhandle format. This could happen as part of a downgrade or similar situation.