datafuselabs / openraft

rust raft with improvements
Apache License 2.0
1.36k stars 151 forks source link

Avoid double-wal #617

Closed avantgardnerio closed 1 year ago

avantgardnerio commented 1 year ago

Is your feature request related to a problem? Please describe.

In the RocksDB example, as well as numerous places in the documentation and discussions, it appears that apply_to_state_machine() is meant to be atomic and durable. Unfortunately, for some store implementations, persisting the whole data structure to disk could be quite costly. An alternative approach is shown in the RocksDB example, where rocks writes it's wal (but not the actual data) to disk, and this is better, but doesn't this result in a double-wal? Raft has it's own log (i.e. append_to_log()), then the application has to have it's own wal in order to complete apply_to_state_machine() quickly and durably.

Describe the solution you'd like

Instead of nested wals, would it be possible to allow the application to:

  1. not persist anything during apply_to_state_machine() and keep changes in memory
  2. the app can periodically flush to disk (data + last_applied_index) based on it's own cron job
  3. if the app crashes, it replays the raft logs since last_applied_index (on it's own - without being told by openraft)
  4. now that the state machine is rebuilt, the app re-joins the cluster
  5. Openraft is unaware that apply_to_state_machine() doesn't immediately persist, but this is fine because the app state machine is still restored prior to resuming anything raft related.

Forgive me, I may be misunderstanding how this is supposed to work, but it appears #437 is effectively asking the same thing? I saw another issue (#208) out there about pipelining requests to apply_to_state_machine() to help alleviate this issue, but perhaps the solution above is either better or at least complimentary?

Describe alternatives you've considered

  1. Creating an app wal in addition to flushing log entries to disk (WAL for both append_to_log and apply_to_state_machine)
  2. Flushing full state to disk in apply_to_state_machine() and having a really slow application.
github-actions[bot] commented 1 year ago

👋 Thanks for opening this issue!

Get help or engage by:

drmingdrmer commented 1 year ago

You are right. An application only needs one WAL. Raft is just a distributed WAL.

  • not persist anything during apply_to_state_machine() and keep changes in memory
  • the app can periodically flush to disk (data + last_applied_index) based on it's own cron job

It's OK not to persist the state, except you may want to persist the last_applied_index.

If you do not persist any state at all, when a node restarts, just rebuild state machine from the last snapshot, and wait for openraft to inform RaftStorage to re-commit and re-apply other logs.

Or last_applied_index should be persisted before apply_to_state_machine() returns. And when a node starts, rebuild state machine from last snapshot, then re-apply logs up to last_applied_index.

Forgive me, I may be misunderstanding how this is supposed to work, but it appears #437 is effectively asking the same thing?

Yes. it's meant to get rid of the expensive persisting state operation.

208 is another thing about state machine. It's about batch-applying logs, and won't conflict with what you want to do.

avantgardnerio commented 1 year ago

Thank you @drmingdrmer for taking the time to provide this detailed description.

I did have one remaining question regarding:

just rebuild state machine from the last snapshot, and wait for openraft to inform RaftStorage to re-commit and re-apply other logs.

Assuming we haven't flushed either the state or last_applied_index the during the last apply_to_state_machine(), when the node crashes and restarts, do we need to do any special initialization logic? Or does openraft call last_applied_state() and as long as we return the correct results, openraft will do all the necessary initialization (re-applying logs to the state machine) for us?

avantgardnerio commented 1 year ago

n/m, I think you answered the above question here: https://github.com/datafuselabs/openraft/discussions/616#discussioncomment-4444773

Whether to apply these log has to be left to openraft to decide

ty!! :smile: