Trace validation test - Githubissues

joshuazh-x commented 3 months ago

Following up on previous discussion with @serathius and @ahrtr, I make some improvements to the Raft trace validation test:

Refactored Test Case: Combine the random trace generation and trace validation script into a single test case for better readability and maintainability.
Simplified Execution: The test can now be run with a simple make trace-validation command.
Enhanced Customization: The test supports customization of the working directory, number of parallel jobs, and the amount of traces to be generated (see README file).
Other improvements: An additional bootstrap trace has been included to simplify the TLA+ specification.

These changes aim to streamline the trace validation process and make it more efficient to automate testing.

Please let me know if you have any questions or feedback.

serathius commented 2 months ago

Please fix the CI failures to ensure timely review.

Also it would be best to CC everyone that was involved with the previous PR

joshuazh-x commented 2 months ago

cc @serathius @ahrtr , here is the trace validation test with some additional improvements in TLA+ spec. The idea here is to set up a sample application based on raft, running with various faults injected. The produced traces will then be collected and validated by TLA+ trace validation. Just run make trace-validation in raft root folder.

serathius commented 2 months ago

CCing everyone involved in https://github.com/etcd-io/raft/pull/113 /cc @xiang90 @pav-kv @ahrtr @zeu5 @lemmy @siyuanfoundation @MadhavJivrajani @andreo @felipecrv

Your input on this PR would be a huge help! The successful integration of TLA+ has the potential to significantly improve the reliability of Raft and etcd. I'd be grateful for any insights you can share.

k8s-ci-robot commented 2 months ago

@serathius: GitHub didn't allow me to request PR reviews from the following users: andreo, felipecrv, xiang90, zeu5, lemmy.

Note that only etcd-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/etcd-io/raft/pull/204#issuecomment-2208275865): >CCing everyone involved in https://github.com/etcd-io/raft/pull/113 >/cc @xiang90 @pav-kv @ahrtr @zeu5 @lemmy @siyuanfoundation @MadhavJivrajani @andreo @felipecrv Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

zeu5 commented 2 months ago

A few changes to the TLA spec are unclear for me. I will summarise my understanding and list the concerns below.

Summary

Two new variables are added applied and durableState.applied. The two variables represent the state in memory and that persisted to the log. durableState.applied is updated on PersistState action and applied is updated on Advance. The rest of the changes in TLA revolve around ensuring sound updates to these two variables.

In the code, a new trace event Advance is introduced to ensure updates to applied variable in TLA.

Clarifications

Why do we need both the variables? In the code, the only applied state tracked is in the log.
Whenever ready is requested, the log.applied is updated. This is inconsistent with the TLA model which requires that applied is updated only after the corresponding Ready is advanced. Is there a violation found corresponding to this inconsistency?
I do not see an event for PersistState in the trace logger. So the corresponding action is never executed?

Minor comments

This is not related to the changes in this PR but unclear to me - when a node is stopped (as a failure introduced makeNodesDown) it doesn't update any variables and is stopped. In the TLA model however, this is never tracked and updates are received to the node. So the state in the code and in the model will be out of sync? Shouldn't the model track active nodes?

joshuazh-x commented 2 months ago

Thanks @zeu5 for the summary.

Please allow me to clarify some of my thoughts when I made this change. Please correct me if I made any mistake here.

Why do we need both the variables? In the code, the only applied state tracked is in the log. Whenever ready is requested, the log.applied is updated. This is inconsistent with the TLA model which requires that applied is updated only after the corresponding Ready is advanced. Is there a violation found corresponding to this inconsistency?

To my understanding, log.applied is updated upon receiving MsgStorageApplyResp, indicating successful applying to the state machine. log.applying ( I assume you meant this) is updated upon ready and serves as an optimization to allow issuing new ready without waiting for complete processing of the previous one (relevant with async storage writes). Since I prioritize sync storage write in this spec so log.applied is used here for determining entries in ready to simplify the spec, Adjustments will be made when the spec covers async storage writes.

I do not see an event for PersistState in the trace logger. So the corresponding action is never executed?

PersistState isn't traced as it's external to the Raft module, but it executes as part of Next. Initially, I aimed for non-determinism (allowing PersistState and SendMessags anywhere in the ready phase), but for TLC efficiency, I eventually choose a constrained behavior enforces sequential execution of PersistState and SendMessages after Ready. So whenever a traced Ready is processed, the behavior Ready->PersistState->Ready will be enforced.

Shouldn't the model track active nodes?

The TLA+ spec focuses on state transitions. A non-functional node implies no state changes or message exchange, effectively modeled by behaviors excluding actions related to that node. The state space exploration will include behaviors representing such node downtime scenarios.

joshuazh-x commented 2 months ago

I'll keep each fix in a separated commit during the review. If everything goes well in the end, I'll combine them into one before final merge.

joshuazh-x commented 1 month ago

/cc @serathius @ahrtr @zeu5 @lemmy any feedbacks for this pr?

k8s-ci-robot commented 1 month ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshuazh-x, lemmy Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/etcd-io/raft/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

ahrtr commented 1 month ago

Please try not to submit huge PR. Please try to breakdown this PR as much as you can,

I see you removed the huge file tla/example.ndjson, pls do it in a separate PR and ensure the readme.md is also updated accordingly.
get the TLA+ changes in a separate PR, and leave it to TLA+ experts to review it.
get the test included in a separate PR. I think that it makes more sense to integrate the trace test with etcd's robustness test. We can discuss that after above two items are resolved.

etcd-io / raft

Trace validation test #204

Summary

Clarifications

Minor comments