Open jerrytesting opened 1 year ago
Besides, we also found one consistency issue about not read committed: https://github.com/jerrytesting/inconsistent-bugs-canonical-raft/tree/main/not_read_committed
Thanks for the reports! Are you able to share any information about the Jepsen setup that produced these results? We'd love to incorporate any changes into https://github.com/canonical/jepsen.dqlite that are needed to reproduce them in our CI.
Sorry that our Jepsen setup is not open-sourced now but it will be soon. Currently, I can only give you enough bug reports to help you figure out the issues.
@jerrytesting No problem, we appreciate your work on this. Does your setup include an "assertion checker", like this in jepsen.dqlite?
Yes, included the assertion checker and reported two assertion failures in ISSUE-291. We are still running to see whether we can capture other issues.
hey @jerrytesting do you have the app.log
for the nodes for not_read_committed
like you have for the other bug in the repo (there the app.log for n3
is missing though) ?
Also, which version of raft are you using? Based on the log messages,you are not on v0.17.1 which contains quite some bugfixes.
What is also strange is e.g. in not_read_uncommitted
in n1
's app.log, the node starts around log line 88690 as a single node in a cluster and it's accepting requests from Jepsen, timestamp 2023/02/13 14:58:56.631530
.
An example of an inconsistent read is key 55 around timestamp 2023-02-13 14:59:41, that's also a time during which node n1 is accepting requests as a single member and leader of a cluster, but during that time n3
is the leader of another cluster with n2
, n4
and n5
. So you have effectively 2 clusters running.
Is it possible you are not setting up the cluster correctly in your tests?
Hey, we run in v0.14.0. I think it should not be the setup issue as we use your Jepsen harness.
Will upload the full logs soon (some logs are very large, so they are missed).
Looking at the logs from a Jepsen test perspective, it's hard to make sense of what's happening. My interpretation is most likely a synthetic, i.e. test environment, failure.
From looking at the log messages, the dqlite.jepsen
tests are quite old, >= 8 months.
The test framework mediator
wrappers that were added to the core of Jepsen break nemesis behavior, impact,and intent.
jepsen.mediator.wrapper.MediatorGen
is not behaving like a generator:
invoke!
dnet
operations to mediator.wrapper
net
and some straight to Jepsen's net
:final-generator
for healingRunning the current jepsen.dqlite
with the same test opts did not reproduce the issue.
Note: Jepsen provides extension mechanisms, e.g. net
protocol, gen/map
, etc that will do what's visible in the logs without the need for a mediator
.
Yeah, there was one bug in our tool that the stop-partition nemesis was not invoked and so not executed. Thanks for finding this bug in our tool. However, the other points you said are not issues. Our tool only decides how to dispatch nemesis but the real generators are still from Jepsen.
I think this consistency violation really exists but happens in an extreme environment; for example, the network partition doesn't be healed in long time.
We found one consistency issue about not read uncomitted in the latest raft version. Here is the bug log: https://github.com/jerrytesting/inconsistent-bugs-canonical-raft/tree/main/not_read_uncommitted and hope it could help you debug.