Figure out why raft quorum bug wasn't detected

cole-miller commented 2 years ago

Apparently our Jepsen tests weren't able to detect the bug in our implementation of the raft quorum logic that's addressed by https://github.com/canonical/raft/pull/302. We should figure out why not, and strengthen the tests so that they successfully detect the bug.

cole-miller commented 2 years ago

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

freeekanayaka commented 2 years ago

That sounds plausible to me.

MathieuBordere commented 2 years ago

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

We should have a higher chance to hit it it if we increase the heartbeat intervals, if I'm not mistaken the heartbeat intervals are determined by setting the network latency. We could randomize the network latency in the tests to try and hit more timing sensitive bugs.

freeekanayaka commented 2 years ago

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

We should have a higher chance to hit it it if we increase the heartbeat intervals, if I'm not mistaken the heartbeat intervals are determined by setting the network latency. We could randomize the network latency in the tests to try and hit more timing sensitive bugs.

That sounds like a good idea, regardless of whether it will help triggering this specific bug. More than randomizing it, perhaps just setting it very high (e.g. 10x current value or more) and run all the tests with that high settings, as well as the normal default setting of course.

canonical / jepsen.dqlite

Figure out why raft quorum bug wasn't detected #28