etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.66k stars 9.75k forks source link

High flakiness of Robustness nightly #15913

Closed serathius closed 1 year ago

serathius commented 1 year ago

Which github workflows are flaking?

image

High as we have at least 1 scenario failure in run, however it's not consistent.

image

Which tests are flaking?

Looks like linearizability times out. I expect it's due one of my recent changes, however wanted to put the issue for grabs as it's not super critical to fix. It's still doesn't impact presubmit tests.

Github Action link

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

fuweid commented 1 year ago

If no one takes this, I would like to help. (trying to catch-up the recently changes) Thanks

ahrtr commented 1 year ago

Thanks @fuweid .

Recently there are big & fast changes (too many PRs) on the robustness test. I don't think it's a good thing to make too many big changes too fast, even it's just test code. Since almost each PR is huge ( e.g. > 300 lines of change), it's really hard to carefully review each line of change.

Fast is definitely not equal to good, especially for etcd which is about 10 years old.

I propose (not limited to the robustness test):

chaochn47 commented 1 year ago

Agreed. Sometimes I saw some PRs are reviewed and merged over the weekend that other etcd members did not get a chance to review, reason about / understand / catch up the changes.

To share my experience, for example https://github.com/etcd-io/etcd/pull/15888 was raised in Saturday early morning 3 AM and get merged at Monday early morning 3 AM in PDT Time zone. I spent 1 hour to review on Sunday but could not finish every line of changes and wanted to review second time on Monday but it was merged ==.

Edit: Great appreciate to @serathius that driving it. In our production environment, we did not see any data inconsistency issues in etcd v3.5 for 4 months now.

jmhbnz commented 1 year ago

Thanks for suggesting a little slower pace @ahrtr. I have been spending time every day to try and look at all the pr's and I have found it tough to keep up and review as thoroughly as I would like to. I have been thinking along the same lines as your suggestions but didn't say anything because I figured it is just because I am newer to the project.

It's awesome that the robustness suite is getting so much improvement though and huge kudos to @serathius for all the work, maybe just a little slower, or leave some pr's as draft or open longer before they merge 🙏🏻

serathius commented 1 year ago

Thanks @fuweid .

Recently there are big & fast changes (too many PRs) on the robustness test. I don't think it's a good thing to make too many big changes too fast, even it's just test code. Since almost each PR is huge ( e.g. > 300 lines of change), it's really hard to carefully review each line of change.

Fast is definitely not equal to good, especially for etcd which is about 10 years old.

I propose (not limited to the robustness test):

  • Slowdown the speed. Can we submit at most one PR at a time for each code area each person?
  • Please try to deliver small PRs as much as possible.
  • Fixing issues (including product issue and test issues) should be always the top priority by default.

I think you overstate the problem of fast development impacting flakiness. Having fails on this runs is not ideal and it should be fixed, but I would not expect that your suggestions will help in this case.

Robustness nightly run over set of 10-11 tests over 400 times. Fail of any of those tests is failure of whole run. This is flakiness of less then 0.5% which is much much much better than any other infrastructure test I have seen. This level of flakiness is not expected to be detected during development, which means that from time to time it will happen that those tests will start failing as we didn't cough something before a merge.

As for coding velocity, sorry if I merged things too fast. I treat approval as approval to merge, if you think PR should be split into smaller ones, or you want second opinion of other contributors, please don't approve immidietly. Latest velosity was motivated by preparation for a demo on SIG-apimachinery community meeting today. You are welcomed to join.

I think I merged majority of the code I needed for the fake, so it contributions will cool-down.

ahrtr commented 1 year ago

which means that from time to time it will happen that those tests will start failing as we didn't cough something before a merge.

Not to blame anyone. We only should try to improve the process based on any lesson learned.

I treat approval as approval to merge

I think we should consider to revoke the approval (similar to K8s automatically removing lgtm and approve labels) if the author forcibly push commits after the PR being approved. We should also try to avoid approving a PR with something like "LGTM with minor comments" (I did this sometimes previously).

serathius commented 1 year ago

We only should try to improve the process based on any lesson learned.

To have lessons learned you need to first find and understand the root cause. As I explained, failed robustness tests is not something that can be easy avoided by just contributing "more carefully".

Please try to break down big PR to small ones as much as possible. I said this multiple times.

Problem is that you are saying it and not requiring it for approval. Please consider approving only changes that not only are correct, but also "how?" they do it is correct.

I think we should consider to revoke the approval (similar to K8s automatically removing lgtm and approve labels) if the author forcibly push commits after the PR being approved. We should also try to avoid approving a PR with something like "LGTM with minor comments" (I did this sometimes previously).

Don't agree with that, "LGTM with minor comments" is a common courtesy that you trust the person to address comments you left. If you don't trust the person or the comments are not minor, please don't do that.

ahrtr commented 1 year ago

As I explained, failed robustness tests is not something that can be easy avoided by just contributing "more carefully".

Sorry for the confusion. What I am talking about isn't specific to the flakiness in this issue. Instead, it's based on recent experiences & thoughts on reviewing PRs.

"LGTM with minor comments" is a common courtesy that you trust the person to address comments you left. If you don't trust the person or the comments are not minor, please don't do that.

It's exactly what I advise against doing. When we have more members, doing things based on agreement & rules is better than personal trust, especially to a critical project (etcd).

I might spend some time to draft a doc something like pr_review_guidance.md under contributor-guide, but it might take some time. If anyone want to deliver a draft PR, please feel free to do it.

serathius commented 1 year ago

I might spend some time to draft a doc something like pr_review_guidance.md under contributor-guide, but it might take some time. If anyone want to deliver a draft PR, please feel free to do it.

Sounds great, I want to make sure that everyone feels empowered to point out mistakes and improvements even to maintainers PRs. We are just human and we will make mistake. We need help of whole community to make etcd the best it can be.

As this discussion has really driven of the original topic I propose to wait till @ahrtr or anyone else creates PR with pr_review_guidence.md.

serathius commented 1 year ago

@fuweid any progress?

serathius commented 1 year ago

Found one issue https://github.com/etcd-io/etcd/pull/16044 that caused failure in https://github.com/etcd-io/etcd/actions/runs/5220401308/jobs/9423421119

fuweid commented 1 year ago

Hi @serathius I was making poc about bumping grpc-gateway version. It's nearly close. Will keep more eyes on this one in this week!

serathius commented 1 year ago

From https://github.com/etcd-io/etcd/actions/runs/5241882575/jobs/9464559248 looks like issue with blackhole failpoint timing out.

2023-06-12T09:36:29.0923710Z     failpoints.go:483: clusterRevision: 5186, blackholedMemberRevision: 669
2023-06-12T09:36:29.0924880Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-2) (11511): {"level":"warn","ts":"2023-06-12T09:36:29.036263Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":62074779906937229,"retry-timeout":"500ms"}
2023-06-12T09:36:29.0926336Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"info","ts":"2023-06-12T09:36:29.055764Z","caller":"etcdserver/server.go:1395","msg":"triggering snapshot","local-member-id":"ac4ec652f10e5b49","local-member-applied-index":5959,"local-member-snapshot-index":5858,"local-member-snapshot-count":100}
2023-06-12T09:36:29.0927845Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-0) (11512): {"level":"info","ts":"2023-06-12T09:36:29.056137Z","caller":"etcdserver/server.go:1395","msg":"triggering snapshot","local-member-id":"eabdbb777cf498cb","local-member-applied-index":5959,"local-member-snapshot-index":5858,"local-member-snapshot-count":100}
2023-06-12T09:36:29.0929069Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-0) (11512): {"level":"info","ts":"2023-06-12T09:36:29.057707Z","caller":"etcdserver/server.go:2413","msg":"saved snapshot","snapshot-index":5959}
2023-06-12T09:36:29.0930219Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-0) (11512): {"level":"info","ts":"2023-06-12T09:36:29.057874Z","caller":"etcdserver/server.go:2443","msg":"compacted Raft logs","compact-index":959}
2023-06-12T09:36:29.0931349Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"info","ts":"2023-06-12T09:36:29.058236Z","caller":"etcdserver/server.go:2413","msg":"saved snapshot","snapshot-index":5959}
2023-06-12T09:36:29.0932417Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"info","ts":"2023-06-12T09:36:29.05838Z","caller":"etcdserver/server.go:2443","msg":"compacted Raft logs","compact-index":959}
2023-06-12T09:36:29.0933731Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-2) (11511): {"level":"warn","ts":"2023-06-12T09:36:29.091216Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"127.0.0.1:50466","server-name":"","error":"read tcp 127.0.0.1:20011->127.0.0.1:50466: i/o timeout"}
2023-06-12T09:36:29.0935174Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"warn","ts":"2023-06-12T09:36:29.091374Z","caller":"etcdserver/cluster_util.go:288","msg":"failed to reach the peer URL","address":"https://localhost:20013/version","remote-member-id":"bf19ae4419db00dc","error":"Get \"https://localhost:20013/version\": EOF"}
2023-06-12T09:36:29.1351361Z /tmp/etcd-release-3.5-failpoints/bin/etcd (TestRobustnessLowTrafficClusterOfSize3-test-1) (11521): {"level":"warn","ts":"2023-06-12T09:36:29.0914Z","caller":"etcdserver/cluster_util.go:155","msg":"failed to get version","remote-member-id":"bf19ae4419db00dc","error":"Get \"https://localhost:20013/version\": EOF"}
2023-06-12T09:36:29.1353672Z     failpoints.go:427: Traffic restored from and to member "TestRobustnessLowTrafficClusterOfSize3-test-2"
2023-06-12T09:36:29.1354729Z     logger.go:130: 2023-06-12T09:36:29.106Z    INFO    gofailpoint trigger failed  {"failpoint": "raftBeforeSaveSnap", "error": "context deadline exceeded"}
2023-06-12T09:36:29.1356152Z     logger.go:130: 2023-06-12T09:36:29.106Z    INFO    Waiting for member to exit  {"member": "TestRobustnessLowTrafficClusterOfSize3-test-2"}
2023-06-12T09:36:29.1357263Z     logger.go:130: 2023-06-12T09:36:29.106Z    INFO    Member didn't exit as expected  {"member": "TestRobustnessLowTrafficClusterOfSize3-test-2", "error": "context deadline exceeded"}
2023-06-12T09:36:29.1358513Z     failpoints.go:109: Triggering failpoints timed out, err: context deadline exceeded

So there are 2 problems:

serathius commented 1 year ago

ping on this. It's hard to develop robustness tests if we don't get any feedback from nightly tests.

serathius commented 1 year ago

Sent https://github.com/etcd-io/etcd/pull/16085

serathius commented 1 year ago

Example of failpoint failure not being properly handled

2023-06-15T16:05:23.4169941Z     logger.go:130: 2023-06-15T16:05:23.347Z    INFO    Triggering failpoint
2023-06-15T16:05:23.4170427Z            {"failpoint": "commitBeforePreCommitHook"}
2023-06-15T16:05:23.4171482Z     logger.go:130: 2023-06-15T16:05:23.347Z    INFO    Setting up gofailpoint  {"failpoint": "commitBeforePreCommitHook"}
2023-06-15T16:05:23.4173159Z     logger.go:130: 2023-06-15T16:05:23.358Z    INFO    goFailpoint setup failed    {"failpoint": "commitBeforePreCommitHook", "error": "Put \"http://127.0.0.1:32381/commitBeforePreCommitHook\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
serathius commented 1 year ago

Linearization timeout https://github.com/etcd-io/etcd/actions/runs/5293465188?pr=16091 Looking at the operation report it seems that history patching didn't clear up many concurrent TXN failures in Kubernetes. Would be worth investigating if this is issue with Kubernetes transaction patching.

serathius commented 1 year ago

Expect high ratio of delete requests is causing the Kubernetes tests to timeout. https://github.com/etcd-io/etcd/pull/16115

serathius commented 1 year ago

With 3 sequential successes I'm closing the issue. image

Thanks @fuweid for help! Awesome find of expect bug.