Open joshuazh-x opened 3 months ago
Please fix the CI failures to ensure timely review.
Also it would be best to CC everyone that was involved with the previous PR
cc @serathius @ahrtr , here is the trace validation test with some additional improvements in TLA+ spec. The idea here is to set up a sample application based on raft, running with various faults injected. The produced traces will then be collected and validated by TLA+ trace validation. Just run make trace-validation
in raft root folder.
CCing everyone involved in https://github.com/etcd-io/raft/pull/113 /cc @xiang90 @pav-kv @ahrtr @zeu5 @lemmy @siyuanfoundation @MadhavJivrajani @andreo @felipecrv
Your input on this PR would be a huge help! The successful integration of TLA+ has the potential to significantly improve the reliability of Raft and etcd. I'd be grateful for any insights you can share.
@serathius: GitHub didn't allow me to request PR reviews from the following users: andreo, felipecrv, xiang90, zeu5, lemmy.
Note that only etcd-io members and repo collaborators can review this PR, and authors cannot review their own PRs.
A few changes to the TLA spec are unclear for me. I will summarise my understanding and list the concerns below.
Two new variables are added applied
and durableState.applied
. The two variables represent the state in memory and that persisted to the log. durableState.applied
is updated on PersistState
action and applied
is updated on Advance
. The rest of the changes in TLA revolve around ensuring sound updates to these two variables.
In the code, a new trace event Advance
is introduced to ensure updates to applied
variable in TLA.
applied
state tracked is in the log. log.applied
is updated. This is inconsistent with the TLA model which requires that applied
is updated only after the corresponding Ready
is advanced. Is there a violation found corresponding to this inconsistency?PersistState
in the trace logger. So the corresponding action is never executed?This is not related to the changes in this PR but unclear to me - when a node is stopped (as a failure introduced makeNodesDown
) it doesn't update any variables and is stopped. In the TLA model however, this is never tracked and updates are received to the node. So the state in the code and in the model will be out of sync? Shouldn't the model track active nodes?
Thanks @zeu5 for the summary.
Please allow me to clarify some of my thoughts when I made this change. Please correct me if I made any mistake here.
Why do we need both the variables? In the code, the only
applied
state tracked is in the log. Whenever ready is requested, thelog.applied
is updated. This is inconsistent with the TLA model which requires thatapplied
is updated only after the correspondingReady
is advanced. Is there a violation found corresponding to this inconsistency?
To my understanding, log.applied
is updated upon receiving MsgStorageApplyResp
, indicating successful applying to the state machine. log.applying
( I assume you meant this) is updated upon ready and serves as an optimization to allow issuing new ready without waiting for complete processing of the previous one (relevant with async storage writes). Since I prioritize sync storage write in this spec so log.applied
is used here for determining entries in ready to simplify the spec, Adjustments will be made when the spec covers async storage writes.
I do not see an event for
PersistState
in the trace logger. So the corresponding action is never executed?
PersistState
isn't traced as it's external to the Raft module, but it executes as part of Next
. Initially, I aimed for non-determinism (allowing PersistState
and SendMessags
anywhere in the ready phase), but for TLC efficiency, I eventually choose a constrained behavior enforces sequential execution of PersistState
and SendMessages
after Ready. So whenever a traced Ready
is processed, the behavior Ready->PersistState->Ready
will be enforced.
Shouldn't the model track active nodes?
The TLA+ spec focuses on state transitions. A non-functional node implies no state changes or message exchange, effectively modeled by behaviors excluding actions related to that node. The state space exploration will include behaviors representing such node downtime scenarios.
I'll keep each fix in a separated commit during the review. If everything goes well in the end, I'll combine them into one before final merge.
/cc @serathius @ahrtr @zeu5 @lemmy any feedbacks for this pr?
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: joshuazh-x, lemmy Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Please try not to submit huge PR. Please try to breakdown this PR as much as you can,
tla/example.ndjson
, pls do it in a separate PR and ensure the readme.md is also updated accordingly.
Following up on previous discussion with @serathius and @ahrtr, I make some improvements to the Raft trace validation test:
make trace-validation
command.These changes aim to streamline the trace validation process and make it more efficient to automate testing.
Please let me know if you have any questions or feedback.