basho / bitcask

because you need another a key/value storage engine
1.29k stars 173 forks source link

Fix `O_SYNC` flag when opening data files #272

Open nuno-faria opened 5 months ago

nuno-faria commented 5 months ago

According to Riak's docs, using the POSIX C API to write files, together with the o_sync strategy, should result in synchronous data writes. However, when looking at the open file descriptors for the Bitcask data, we can see that only the .write.lock files are opened with the O_SYNC flag, e.g.:

COMMAND     PID   TID TASKCMD   USER   FD      TYPE         FILE-FLAG             DEVICE SIZE/OFF    NODE NAME
beam.smp  86800 87045 Eleveldb  root  489u      REG         RW,SYN,LG              0,183       86 2043031 /riak/rel/riak/data/bitcask/1324485858831130769622089379649131486563188867072/bitcask.write.lock
beam.smp  86800 87045 Eleveldb  root  490u      REG          RW,AP,LG              0,183    14767 2043032 /riak/rel/riak/data/bitcask/319703483166135013357056057156686910549735243776/1.bitcask.data
beam.smp  86800 87045 Eleveldb  root  491u      REG          RW,AP,LG              0,183     3036 2043033 /riak/rel/riak/data/bitcask/319703483166135013357056057156686910549735243776/1.bitcask.hint
beam.smp  86800 87045 Eleveldb  root  493u      REG          RW,AP,LG              0,183    13427 2043035 /riak/rel/riak/data/bitcask/1324485858831130769622089379649131486563188867072/1.bitcask.data
beam.smp  86800 87045 Eleveldb  root  494u      REG          RW,AP,LG              0,183     2760 2043036 /riak/rel/riak/data/bitcask/1324485858831130769622089379649131486563188867072/1.bitcask.hint
...

This means writes are still asynchronous, which could lead to data loss even after sending the reply to the client.

The reason for this is that while the get_file_open_flags C function adds the O_SYNC to the bitmask, it replaces it when it receives an ATOM_CREATE option. The easiest fix is to simply add the o_sync option to the end of the list at bitcask_fileops.erl:87, to ensure that it always appears after the create. Now the files are correctly opened with the O_SYNC flag:

COMMAND     PID   TID TASKCMD   USER   FD      TYPE         FILE-FLAG             DEVICE SIZE/OFF    NODE NAME
beam.smp   4080 6414 Eleveldb  root  553u      REG         RW,SYN,LG              0,183       85 2043159 /riak/rel/riak/data/bitcask/1324485858831130769622089379649131486563188867072/bitcask.write.lock
beam.smp   4080 6414 Eleveldb  root  554u      REG      RW,AP,SYN,LG              0,183     5106 2043160 /riak/rel/riak/data/bitcask/319703483166135013357056057156686910549735243776/1.bitcask.data
beam.smp   4080 6414 Eleveldb  root  555u      REG      RW,AP,SYN,LG              0,183     1058 2043161 /riak/rel/riak/data/bitcask/319703483166135013357056057156686910549735243776/1.bitcask.hint
beam.smp   4080 6414 Eleveldb  root  557u      REG      RW,AP,SYN,LG              0,183     3988 2043163 /riak/rel/riak/data/bitcask/1324485858831130769622089379649131486563188867072/1.bitcask.data
beam.smp   4080 6414 Eleveldb  root  558u      REG      RW,AP,SYN,LG              0,183      828 2043164 /riak/rel/riak/data/bitcask/1324485858831130769622089379649131486563188867072/1.bitcask.hint
...
martinsumner commented 3 months ago

@nsaadouni - don't know if you've picked up on this PR, but is this something your team could look at?

martinsumner commented 3 months ago

Note that this is also I believe broken when using the erlang bitcask io_mode. It adds the o_sync flag, when the mode supported in Erlang is to use sync instead.

I think this is an issue dating back to when a bespoke version of the erlang VM was used within Riak that was patched to add support for o_sync, but when the PR was accepted upstream the name was changed to sync.

For a new user of bitcask, I think use of the nif IO mode is questionable. Given the fact that native erlang file IO now uses dirty IO scheduling, it would seem to be obviously safer than using a bespoke nif without dirty-nif support.

Would be interesting to know what the current big users of bitcask actually use in production.

nuno-faria commented 3 months ago

@martinsumner In our case, the main reason we chose the nif mode was due to the warning provided in the most recent Riak documentation about Bitcask as the backend, stating that:

"Synchronous file I/O via o_sync is supported in Bitcask if io_mode is set to nif and is not supported in the erlang mode."

So we just relied on the nif interface, later discovering the bug.