etcd-io / raft

Raft library for maintaining a replicated state machine
Apache License 2.0
666 stars 164 forks source link

raft: use a type-safe slice of log entries #46

Closed pav-kv closed 9 months ago

pav-kv commented 1 year ago

All []pb.Entry slices that raft works with are contiguous valid log ranges, for which the following invariants hold:

Much of the code implicitly assumes this, and will behave arbitrarily if this isn't true. This PR adds a LogRange type which is a type-safe slice promising these invariants explicitly.


Additionally, entry slices are often used in conjunction with an (index, term) tuple which describes the entry that immediately precedes the slice, i.e.:

For example, the log append handling stack (starting from a MsgApp message handler all the way down to unstable and Storage) operates with such a construct, effectively a "conditional put".

We have seen incorrect (not adhering to this invariant) MsgApp being sent (#47), or received (https://github.com/etcd-io/etcd/issues/14735). A type-safe / verified entries "append" slice should reduce the chance of such bugs, or at least make them better discoverable.

This PR introduces such a type, logAppend, and refactors the raftLog type to take advantage of it. This also makes the code cleaner.

This type could evolve into "the only" way of storing and passing entry slices around. Effectively, the log (or its sub-range) is always an (index, term) indicating a compaction or other "watermark", plus a []pb.Entry suffix of currently available entries. This seems true for unstable and raftLog types. Another example is the MemoryStorage.

pav-kv commented 1 year ago

The sanity checking introduced in this PR helped to uncover an invalid MsgApp that a leader may send, in TestLearnerReceiveSnapshot:

--- FAIL: TestLearnerReceiveSnapshot (0.00s)
panic: 2: invalid MsgApp [index 11, term 11]: entry[0].Term is 1, want at least 11 [recovered]
        panic: 2: invalid MsgApp [index 11, term 11]: entry[0].Term is 1, want at least 11

Figuring out whether it's a result of the specific test setup. But either way, the leader should not send incorrect MsgApp.

pav-kv commented 1 year ago

@tbg @ahrtr PTAL, what do you think of the concept? This is inspired by https://github.com/etcd-io/etcd/issues/15595 and https://github.com/etcd-io/etcd/issues/15704 where entry slices are [suspect to be] invalid.

In the meantime I'm looking at the panic in TestLearnerReceiveSnapshot.

ahrtr commented 1 year ago

It's a good idea to me, but we should only enable the verification in test. See https://github.com/etcd-io/raft/pull/46#discussion_r1166137357

pav-kv commented 1 year ago

Fixing the test failures in #47 - it's a pre-existing bug in tests, and a lack of invariant checks that lead to an incorrect MsgApp being sent.