Closed serathius closed 1 year ago
If no one takes this, I would like to help. (trying to catch-up the recently changes) Thanks
Thanks @fuweid .
Recently there are big & fast changes (too many PRs) on the robustness test. I don't think it's a good thing to make too many big changes too fast, even it's just test code. Since almost each PR is huge ( e.g. > 300 lines of change), it's really hard to carefully review each line of change.
Fast is definitely not equal to good, especially for etcd which is about 10 years old.
I propose (not limited to the robustness test):
Agreed. Sometimes I saw some PRs are reviewed and merged over the weekend that other etcd members did not get a chance to review, reason about / understand / catch up the changes.
To share my experience, for example https://github.com/etcd-io/etcd/pull/15888 was raised in Saturday early morning 3 AM and get merged at Monday early morning 3 AM in PDT Time zone. I spent 1 hour to review on Sunday but could not finish every line of changes and wanted to review second time on Monday but it was merged ==.
Edit: Great appreciate to @serathius that driving it. In our production environment, we did not see any data inconsistency issues in etcd v3.5 for 4 months now.
Thanks for suggesting a little slower pace @ahrtr. I have been spending time every day to try and look at all the pr's and I have found it tough to keep up and review as thoroughly as I would like to. I have been thinking along the same lines as your suggestions but didn't say anything because I figured it is just because I am newer to the project.
It's awesome that the robustness suite is getting so much improvement though and huge kudos to @serathius for all the work, maybe just a little slower, or leave some pr's as draft or open longer before they merge 🙏🏻
Thanks @fuweid .
Recently there are big & fast changes (too many PRs) on the robustness test. I don't think it's a good thing to make too many big changes too fast, even it's just test code. Since almost each PR is huge ( e.g. > 300 lines of change), it's really hard to carefully review each line of change.
Fast is definitely not equal to good, especially for etcd which is about 10 years old.
I propose (not limited to the robustness test):
- Slowdown the speed. Can we submit at most one PR at a time for each code area each person?
- Please try to deliver small PRs as much as possible.
- Fixing issues (including product issue and test issues) should be always the top priority by default.
I think you overstate the problem of fast development impacting flakiness. Having fails on this runs is not ideal and it should be fixed, but I would not expect that your suggestions will help in this case.
Robustness nightly run over set of 10-11 tests over 400 times. Fail of any of those tests is failure of whole run. This is flakiness of less then 0.5% which is much much much better than any other infrastructure test I have seen. This level of flakiness is not expected to be detected during development, which means that from time to time it will happen that those tests will start failing as we didn't cough something before a merge.
As for coding velocity, sorry if I merged things too fast. I treat approval as approval to merge, if you think PR should be split into smaller ones, or you want second opinion of other contributors, please don't approve immidietly. Latest velosity was motivated by preparation for a demo on SIG-apimachinery community meeting today. You are welcomed to join.
I think I merged majority of the code I needed for the fake, so it contributions will cool-down.
which means that from time to time it will happen that those tests will start failing as we didn't cough something before a merge.
Not to blame anyone. We only should try to improve the process based on any lesson learned.
I treat approval as approval to merge
I think we should consider to revoke the approval (similar to K8s automatically removing lgtm
and approve
labels) if the author forcibly push commits after the PR being approved. We should also try to avoid approving a PR with something like "LGTM with minor comments" (I did this sometimes previously).
We only should try to improve the process based on any lesson learned.
To have lessons learned you need to first find and understand the root cause. As I explained, failed robustness tests is not something that can be easy avoided by just contributing "more carefully".
Please try to break down big PR to small ones as much as possible. I said this multiple times.
Problem is that you are saying it and not requiring it for approval. Please consider approving only changes that not only are correct, but also "how?" they do it is correct.
I think we should consider to revoke the approval (similar to K8s automatically removing lgtm and approve labels) if the author forcibly push commits after the PR being approved. We should also try to avoid approving a PR with something like "LGTM with minor comments" (I did this sometimes previously).
Don't agree with that, "LGTM with minor comments" is a common courtesy that you trust the person to address comments you left. If you don't trust the person or the comments are not minor, please don't do that.
As I explained, failed robustness tests is not something that can be easy avoided by just contributing "more carefully".
Sorry for the confusion. What I am talking about isn't specific to the flakiness in this issue. Instead, it's based on recent experiences & thoughts on reviewing PRs.
"LGTM with minor comments" is a common courtesy that you trust the person to address comments you left. If you don't trust the person or the comments are not minor, please don't do that.
It's exactly what I advise against doing. When we have more members, doing things based on agreement & rules is better than personal trust, especially to a critical project (etcd).
I might spend some time to draft a doc something like pr_review_guidance.md
under contributor-guide, but it might take some time. If anyone want to deliver a draft PR, please feel free to do it.
I might spend some time to draft a doc something like pr_review_guidance.md under contributor-guide, but it might take some time. If anyone want to deliver a draft PR, please feel free to do it.
Sounds great, I want to make sure that everyone feels empowered to point out mistakes and improvements even to maintainers PRs. We are just human and we will make mistake. We need help of whole community to make etcd the best it can be.
As this discussion has really driven of the original topic I propose to wait till @ahrtr or anyone else creates PR with pr_review_guidence.md.
@fuweid any progress?
Found one issue https://github.com/etcd-io/etcd/pull/16044 that caused failure in https://github.com/etcd-io/etcd/actions/runs/5220401308/jobs/9423421119
Hi @serathius I was making poc about bumping grpc-gateway version. It's nearly close. Will keep more eyes on this one in this week!
From https://github.com/etcd-io/etcd/actions/runs/5241882575/jobs/9464559248 looks like issue with blackhole failpoint timing out.
2023-06-12T09:36:29.0923710Z failpoints.go:483: clusterRevision: 5186, blackholedMemberRevision: 669
2023-06-12T09:36:29.0924880Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-2) (11511): {"level":"warn","ts":"2023-06-12T09:36:29.036263Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":62074779906937229,"retry-timeout":"500ms"}
2023-06-12T09:36:29.0926336Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"info","ts":"2023-06-12T09:36:29.055764Z","caller":"etcdserver/server.go:1395","msg":"triggering snapshot","local-member-id":"ac4ec652f10e5b49","local-member-applied-index":5959,"local-member-snapshot-index":5858,"local-member-snapshot-count":100}
2023-06-12T09:36:29.0927845Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-0) (11512): {"level":"info","ts":"2023-06-12T09:36:29.056137Z","caller":"etcdserver/server.go:1395","msg":"triggering snapshot","local-member-id":"eabdbb777cf498cb","local-member-applied-index":5959,"local-member-snapshot-index":5858,"local-member-snapshot-count":100}
2023-06-12T09:36:29.0929069Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-0) (11512): {"level":"info","ts":"2023-06-12T09:36:29.057707Z","caller":"etcdserver/server.go:2413","msg":"saved snapshot","snapshot-index":5959}
2023-06-12T09:36:29.0930219Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-0) (11512): {"level":"info","ts":"2023-06-12T09:36:29.057874Z","caller":"etcdserver/server.go:2443","msg":"compacted Raft logs","compact-index":959}
2023-06-12T09:36:29.0931349Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"info","ts":"2023-06-12T09:36:29.058236Z","caller":"etcdserver/server.go:2413","msg":"saved snapshot","snapshot-index":5959}
2023-06-12T09:36:29.0932417Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"info","ts":"2023-06-12T09:36:29.05838Z","caller":"etcdserver/server.go:2443","msg":"compacted Raft logs","compact-index":959}
2023-06-12T09:36:29.0933731Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-2) (11511): {"level":"warn","ts":"2023-06-12T09:36:29.091216Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"127.0.0.1:50466","server-name":"","error":"read tcp 127.0.0.1:20011->127.0.0.1:50466: i/o timeout"}
2023-06-12T09:36:29.0935174Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"warn","ts":"2023-06-12T09:36:29.091374Z","caller":"etcdserver/cluster_util.go:288","msg":"failed to reach the peer URL","address":"https://localhost:20013/version","remote-member-id":"bf19ae4419db00dc","error":"Get \"https://localhost:20013/version\": EOF"}
2023-06-12T09:36:29.1351361Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"warn","ts":"2023-06-12T09:36:29.0914Z","caller":"etcdserver/cluster_util.go:155","msg":"failed to get version","remote-member-id":"bf19ae4419db00dc","error":"Get \"https://localhost:20013/version\": EOF"}
2023-06-12T09:36:29.1353672Z failpoints.go:427: Traffic restored from and to member "TestRobustnessLowTrafficClusterOfSize3-test-2"
2023-06-12T09:36:29.1354729Z logger.go:130: 2023-06-12T09:36:29.106Z INFO gofailpoint trigger failed {"failpoint": "raftBeforeSaveSnap", "error": "context deadline exceeded"}
2023-06-12T09:36:29.1356152Z logger.go:130: 2023-06-12T09:36:29.106Z INFO Waiting for member to exit {"member": "TestRobustnessLowTrafficClusterOfSize3-test-2"}
2023-06-12T09:36:29.1357263Z logger.go:130: 2023-06-12T09:36:29.106Z INFO Member didn't exit as expected {"member": "TestRobustnessLowTrafficClusterOfSize3-test-2", "error": "context deadline exceeded"}
2023-06-12T09:36:29.1358513Z failpoints.go:109: Triggering failpoints timed out, err: context deadline exceeded
So there are 2 problems:
ping on this. It's hard to develop robustness tests if we don't get any feedback from nightly tests.
Example of failpoint failure not being properly handled
2023-06-15T16:05:23.4169941Z logger.go:130: 2023-06-15T16:05:23.347Z INFO Triggering failpoint
2023-06-15T16:05:23.4170427Z {"failpoint": "commitBeforePreCommitHook"}
2023-06-15T16:05:23.4171482Z logger.go:130: 2023-06-15T16:05:23.347Z INFO Setting up gofailpoint {"failpoint": "commitBeforePreCommitHook"}
2023-06-15T16:05:23.4173159Z logger.go:130: 2023-06-15T16:05:23.358Z INFO goFailpoint setup failed {"failpoint": "commitBeforePreCommitHook", "error": "Put \"http://127.0.0.1:32381/commitBeforePreCommitHook\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
Linearization timeout https://github.com/etcd-io/etcd/actions/runs/5293465188?pr=16091 Looking at the operation report it seems that history patching didn't clear up many concurrent TXN failures in Kubernetes. Would be worth investigating if this is issue with Kubernetes transaction patching.
Expect high ratio of delete requests is causing the Kubernetes tests to timeout. https://github.com/etcd-io/etcd/pull/16115
With 3 sequential successes I'm closing the issue.
Thanks @fuweid for help! Awesome find of expect bug.
Which github workflows are flaking?
High as we have at least 1 scenario failure in run, however it's not consistent.
Which tests are flaking?
Looks like linearizability times out. I expect it's due one of my recent changes, however wanted to put the issue for grabs as it's not super critical to fix. It's still doesn't impact presubmit tests.
Github Action link
No response
Reason for failure (if possible)
No response
Anything else we need to know?
No response