When adding outages I noticed that leaders weren't gracefully bringing followers up to date. There were a variety of errors mostly index errors.
The solution was to capture index errors at the source then report them more completely. There were two primary issues:
My reading of "If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it" was incorrect, I was comparing the previous log entry not the current entry.
Log truncation should occur on heartbeats, not just if entries are sent.
There is still the possibility that out of order append entries arrive, in this case I'm just dropping the append entries for now.
When adding outages I noticed that leaders weren't gracefully bringing followers up to date. There were a variety of errors mostly index errors.
The solution was to capture index errors at the source then report them more completely. There were two primary issues:
My reading of "If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it" was incorrect, I was comparing the previous log entry not the current entry.
Log truncation should occur on heartbeats, not just if entries are sent.
There is still the possibility that out of order append entries arrive, in this case I'm just dropping the append entries for now.