Closed pav-kv closed 9 months ago
The sanity checking introduced in this PR helped to uncover an invalid MsgApp
that a leader may send, in TestLearnerReceiveSnapshot
:
--- FAIL: TestLearnerReceiveSnapshot (0.00s)
panic: 2: invalid MsgApp [index 11, term 11]: entry[0].Term is 1, want at least 11 [recovered]
panic: 2: invalid MsgApp [index 11, term 11]: entry[0].Term is 1, want at least 11
Figuring out whether it's a result of the specific test setup. But either way, the leader should not send incorrect MsgApp
.
@tbg @ahrtr PTAL, what do you think of the concept? This is inspired by https://github.com/etcd-io/etcd/issues/15595 and https://github.com/etcd-io/etcd/issues/15704 where entry slices are [suspect to be] invalid.
In the meantime I'm looking at the panic in TestLearnerReceiveSnapshot
.
It's a good idea to me, but we should only enable the verification in test. See https://github.com/etcd-io/raft/pull/46#discussion_r1166137357
Fixing the test failures in #47 - it's a pre-existing bug in tests, and a lack of invariant checks that lead to an incorrect MsgApp
being sent.
All
[]pb.Entry
slices thatraft
works with are contiguous valid log ranges, for which the following invariants hold:entries[i+1].Index == entries[i].Index + 1
entries[i+1].Term >= entries[i].Term
Term > 0
andIndex > 0
; the only zero case is the "sentinel"(0, 0)
entry.Much of the code implicitly assumes this, and will behave arbitrarily if this isn't true. This PR adds a
LogRange
type which is a type-safe slice promising these invariants explicitly.Additionally, entry slices are often used in conjunction with an
(index, term)
tuple which describes the entry that immediately precedes the slice, i.e.:entries[0].Index == index + 1
entries[0].Term >= term
For example, the log append handling stack (starting from a
MsgApp
message handler all the way down tounstable
andStorage
) operates with such a construct, effectively a "conditional put".We have seen incorrect (not adhering to this invariant)
MsgApp
being sent (#47), or received (https://github.com/etcd-io/etcd/issues/14735). A type-safe / verified entries "append" slice should reduce the chance of such bugs, or at least make them better discoverable.This PR introduces such a type,
logAppend
, and refactors theraftLog
type to take advantage of it. This also makes the code cleaner.This type could evolve into "the only" way of storing and passing entry slices around. Effectively, the log (or its sub-range) is always an
(index, term)
indicating a compaction or other "watermark", plus a[]pb.Entry
suffix of currently available entries. This seems true forunstable
andraftLog
types. Another example is theMemoryStorage
.