cojen / TuplDB

TuplDB is a high-performance, concurrent, transactional, scalable, low-level embedded database.
GNU Affero General Public License v3.0
110 stars 22 forks source link

Introduce durable commit index #78

Closed broneill closed 7 years ago

broneill commented 7 years ago

Currently, the commit index is persisted in the FileStateLog metadata file, but the commit index is advanced before the log is durable. This improves performance at the risk that some committed data might have to rollback after a restart. The bigger problem is that the database might not be able to rollback, having issued a checkpoint against the non-durable commit index.

  1. The commit index in the metadata file is renamed to "durable commit index", and it cannot advance from a normal sync operation. A normal sync is just local and cannot infer durable consensus.
  2. A syncCommit operation can only be truly issued by the leader, which calls a new RPC method, which resembles the normal append-entries method.
  3. The RPC implementation performs an append-entries operation, and also syncs the local log as per the Raft spec. A special reply is written back to the leader, which indicates the sync success.
  4. Leader waits for a majority, but aborts if the term changes. This is the usual Raft consensus.
  5. Leader updates the durable commit index in the metadata file.
  6. Leader propagates the durable commit index to all peers, and propagates it with each affirmation message.
  7. Replicas, upon seeing a new durable commit index, can then update the metadata file.
  8. Replicas which are calling syncCommit must call the leader and wait for step 7 to occur, aborting if the term changes.

It's not strictly required that the durable commit index be persisted in the metadata file, but it makes it possible for a restarted replica to catch up before a leader has been elected. The tail end of the log (beyond the durable commit index) still requires a leader for it to be committed.