eBay / NuRaft

C++ implementation of Raft core logic as a replication library
Apache License 2.0
1k stars 237 forks source link

Persistent log store #258

Closed kishorekrd closed 2 years ago

kishorekrd commented 3 years ago

Hi,

Is there any reference implementation for Persistent log store with NuRaft with writing log records and snapshots to disk?

Thanks

greensky00 commented 3 years ago

Hi @kishorekrd

Unfortunately, our log store implementation is not open-sourced. We are using Jungle to implement log store, and you can refer to the below example: https://github.com/eBay/Jungle/blob/master/examples/example_log_store_mode.cc

Steamgjk commented 2 years ago

Hi, @greensky00 Does that mean, the current implementation of NuRaft on github does not include log persistence (writing to disk)? [Because Raft requires log persistence in case of power failure. ]

I notice there is a logger class that can call "write" to do write disk.

greensky00 commented 2 years ago

@Steamgjk You can implement your log_store that does fsync for each log append. We merely provide an example of log store that is not durable.

logger is not a Raft log store -- it is for debugging log.

kishorekrd commented 2 years ago

Hi @greensky00, Now I am writing the append log record to disk in the "ulong append(ptr& entry);" call. Here is a scenario

  1. Append persisted 10 log records
  2. Commit got 9 log records
  3. 9 log records are processed

In this case, If system crashes I will see that 10 log records in the backup. How to know that 10th log record is not committed? What are the other information I need to persist in the append log record's write to bring back the system to same state as before the crash. In the example state_mgr.h, I see that calls like save_config() , save_state() to write to disk. When do Raft calls this methods?

Steamgjk commented 2 years ago

Hi @greensky00, Now I am writing the append log record to disk in the "ulong append(ptr& entry);" call. Here is a scenario

1. Append persisted 10 log records

2. Commit got 9 log records

3. 9 log records are processed

In this case, If system crashes I will see that 10 log records in the backup. How to know that 10th log record is not committed? What are the other information I need to persist in the append log record's write to bring back the system to same state as before the crash. In the example state_mgr.h, I see that calls like save_config() , save_state() to write to disk. When do Raft calls this methods?

I am not the contributor to NuRaft, so my understanding may be wrong (to be confirmed by @greensky00 )

  1. You cannot know whether or not the 10 logs are committed, with a single clue that you find them in the disk of ONE replica
  2. But raft guarantees that, if there are some committed logs before crash, they must exist in at least one of the replica's disk.
  3. I think save_state is something related to snapshot, but log persistence needs to be done every time the replica (follower) replies to every grpc request from the leader. [Alert: the open sourced Nuraft (as well as most open-sourced Raft implementation in github) does not implement log persistence. Actually, a complete Raft implementation with log persistence will have a very low performance. If you call fsync every time to persist the log, you will see a very low performance for Nuraft.]
greensky00 commented 2 years ago

@kishorekrd The last committed index should be persisted in the state machine, and should be retrieved via this API: https://github.com/eBay/NuRaft/blob/789cc75869a6914d4c13aab6c2d5b48dba198f68/include/libnuraft/state_machine.hxx#L273-L283 Or if you really want to know the committed index at the moment the server receives the request, you may use this callback function to persist it: https://github.com/eBay/NuRaft/blob/789cc75869a6914d4c13aab6c2d5b48dba198f68/src/handle_append_entries.cxx#L600-L601 But I wonder why you are taking care of the last committed index. It is natural that the log store can contain uncommitted logs at the end, and they will be soon committed or discarded after the first communication with the existing leader, as @Steamgjk mentioned.

And please note that it is preferred to call fsync in end_of_append_batch API, https://github.com/eBay/NuRaft/blob/789cc75869a6914d4c13aab6c2d5b48dba198f68/include/libnuraft/log_store.hxx#L79-L86 instead of calling it in each append as it will be very inefficient because NuRaft sends multiple logs in batch.

kishorekrd commented 2 years ago

Hi greensky00,

From my example 10 log records are appended, but 9 log records are committed. That means I processed 9 committed log records in my state machine. Now system crashed/rebooted. Now at the time of the recovery, I will see 10 log records in the append log backup. For recreating my state machine, I will first process the latest snapshot and then have to process the log records from the append backup. If I don't know up to what index the log records are committed in the previous session, how can I restore the state machine to the same state as before the crash/reboot?

What is the difference between flush() and end_of_append_batch()?

Thanks

greensky00 commented 2 years ago

For recreating my state machine, I will first process the latest snapshot and then have to process the log records from the append backup. If I don't know up to what index the log records are committed in the previous session, how can I restore the state machine to the same state as before the crash/reboot?

As I mentioned above, your state machine should remember the last committed index, and return it via state_machine::last_commit_index() API call.

And also, is there any reason why you do this by yourself? As long as state_machine::last_commit_index() returns the correct index, you don't need to care about this; NuRaft does this automatically.

kishorekrd commented 2 years ago

Hi @greensky00 , sorry, May be I am missing some details here. After system reboot, at recovery time, I am thinking that I need to restore state_machine::last_commit_index(), so that raft will treat the remaining log entries in the log store as uncommitted. Snapshot will have the last_commit_index at the time of the snapshot creation. log store will have all the log entries with their index number. But how and where to recover last_commit_index at last reboot time? Do I need to write it to disk for every commit? Currently I am writing only snapshot and log store to disk.

greensky00 commented 2 years ago

@kishorekrd During the NuRaft initialization, last_commit_index should not be the last index number right before the reboot. it should be the last applied index of the current state machine so that Raft can replay the log from the "last state" of the state machine.

Since you sync the data to disk for every snapshot creation, when NuRaft restarts, the last state of the state machine is the snapshot, hence you should replay the log from the index of the snapshot. Then, the first state_machine::last_commit_index() call should return the index that snapshot has.

For example, let's say we create a snapshot for every 3 log-append.

commit log 1 -> state machine: log 1 / snapshot: empty
commit log 2 -> state machine: log 2 / snapshot: empty
commit log 3 -> state machine: log 3 / snapshot: log 3
commit log 4 -> state machine: log 4 / snapshot: log 3
--- crash & restart ---

After restart, the state machine: log 4 will be lost, you will load snapshot: log 3, thus the current state machine is state machine: log 3. In such a case, your last_commit_index should be 3.

zhanglistar commented 2 years ago

In clickhouse, there is a persistent log store implementation, https://github.com/ClickHouse/ClickHouse/blob/master/src/Coordination/KeeperLogStore.h