cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.76k stars 441 forks source link

db: `NoSync` should (optionally?) flush writes to the OS #2624

Open erikgrinaker opened 1 year ago

erikgrinaker commented 1 year ago

Committing a batch with Sync: false not only disables fsync, but also disables flushing the write to the OS. This means that it's not only vulnerable to data loss on OS crashes or power loss, but also on unclean process termination, which is a much more common failure.

This is particularly problematic with e.g. Raft replication, where some users may want to disable fsync for the performance gain, accepting the risk of data loss when a quorum of replicas lose writes with the reasoning that correlated power loss or OS crashes across failure domains is rare. However, correlated process crashes is much higher probability: consider e.g. a panic or OOM that's triggered by user queries, where these are load balanced across several nodes holding replicas of the range.

To mitigate this risk, we should either make Sync: false flush the write to the OS, or add another setting that enables this behavior. This may imply slightly lower performance, but that seems worth the significant reliability improvement.

Jira issue: PEBBLE-61

jbowens commented 1 year ago

One note around safety: it's not strictly more safe to write to the OS buffer cache. As described in the fsyncgate postgres emails, the OS buffer cache is unreliable in the event that there's a failure while explicitly fsyncing or during a background write back. This can manifest in many different unexpected ways, like data appearing to be persisted but then rewinding some time later. The only way to be assured of the state of the OS buffer cache after a fsync error is to power cycle the machine (even then, that's only if the error makes it to userland which apparently isn't guaranteed). This is one of the motivating factors for exploring direct I/O for the WAL.

jbowens commented 1 year ago

A relevant RocksDB blog post: FlushWAL; less fwrite, faster writes

Apparently, RocksDB exposes a method to explicitly request the WAL be flushed to the OS cache.

erikgrinaker commented 1 year ago

This is one of the motivating factors for exploring direct I/O for the WAL.

If we do end up supporting running Raft without fsync, I suppose we could disable direct I/O in that case and flush to the OS buffer cache instead? I think that'd likely be beneficial anyway, since it presumably has a much larger buffer?

sumeerbhola commented 1 year ago

fwiw, I'm not keen on doing anything here without a clear design on how CRDB is going to function with raft flushing only to the OS buffer cache. I realize there is discussion on https://github.com/cockroachdb/cockroach/issues/88442, but that is far from being a design. Otherwise we are just adding another error prone piece of functionality (as mentioned by @jbowens in https://github.com/cockroachdb/pebble/issues/2624#issuecomment-1587446172) to Pebble. The kv.raft_log.disable_synchronization_unsafe cluster setting has warning language, and perhaps should say even stronger things to prevent anyone from using it.

erikgrinaker commented 1 year ago

Yeah, this is very speculative, we don't need any immediate action here.

We do have large users running with fsync disabled already though, and presumably some could end up doing so via kv.raft_log.disable_synchronization_unsafe instead of at the filesystem level. Personally I thought those options were equivalent, so I wouldn't be surprised if users made the same mistake.

It would probably be prudent to add this detail to the help text of that setting. I'll submit a PR.

erikgrinaker commented 1 year ago

Updating help text: https://github.com/cockroachdb/cockroach/pull/104888

rjl493456442 commented 1 year ago

Hey! We also encountered this problem in go-ethereum project

The case of us is: in go-ethereum, we have two different storage engines used together: (1) key-value database and (2) an extra storage engine called freezer which is backed by a handful of append-only flat files.

We always have this assumption that once writes are committed to the key-value store, these writes will not be lost due to process crash. If due to os crash or disk failure, all writes(including the following writes to freezer) submitted to the os buffer cache will all be lost. Thus we have a certain degree of writing order guarantee between two different storage engines.

However, switching from leveldb to pebble breaks this assumption because of the WAL writing behavioral difference. We also ask the feature to have an additional write mode to (1) write WAL to os cache without fsync.