Deflake etcd tests - Githubissues

etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system

https://etcd.io

Apache License 2.0

47.73k stars 9.76k forks source link

Deflake etcd tests #13167

Open serathius opened 3 years ago

serathius commented 3 years ago

If we look into tests results since we migrated Github Actions commits on main branch we get:

7 out of 32 failures on 1st page
14 out of 31 failures on 2nd page
14 out of 33 failures on 3rd page
15 out of 29 failures on 4th page
13 out of 25 failures on 5th page
13 out of 22 failures on 6th page
19 out of 33 failures on 7th page

Where failure/success is based on green check vs red cross under commit message (commits without them means that they were not tested as they were multiple commits in one PR).

Those are all test failures on main branch, so after a PR passed tests and was approved. We can use those failures to calculate chance of any PR failing to pass tests just due to test flaking.

(7 + 14 + 14+ 14 + 15 + 13 + 13 + 19) / (32 + 31 + 33 + 29 + 25 + 22 + 33) = 53%

Having flakyness ratio of over 50% means that average PR needs to be run 2 times, but number of failures in sequences may be much much longer, 3-5 failures in row is not something uncommon. This can be frustrating especially to new contributors, as there is no easy way to retrigger tests (need to do an empty commit amend and push).

Proposal

Etcd community should set on a test flakyness target, measure it and establish a process to fix flaky tests.

For start I would propose to target a 10% failure rate for whole test suite. It should be reachable by fixing only couple of tests as from last runs we got 22% (7 out of last 32). Measuring flakyness could start from something simple, like for example running a script once a week that checks last 100 test results. If the measured flakyness is over our target, we should identify most flaky tests, create issues for them and encourage community to fix them.

For couple of first runs we could depend on executing the scripts manualy, but we should plan to automate them.

TODO:

Create a script to measure flakyness (@karuppiah7890)
Create a script to identify flaky tests
- Export JUnit report from tests ( https://github.com/etcd-io/etcd/pull/13112)
- Upload the reports to Github artifacts (https://github.com/etcd-io/etcd/pull/13152)
- Implement script that analyses the reports
Automate the process

cc @hexfusion @Rajalakshmi-Girish

karuppiah7890 commented 3 years ago

This sounds like a pretty interesting thing and also like a thing that alleviates a lot of pain and improves developer experience !

karuppiah7890 commented 3 years ago

I was able to get a basic bash script using GitHub GraphQL API - https://github.com/karuppiah7890/issues-info/blob/main/etcd-io/etcd/issue-13167/find-flaky-tests-data.sh . It gives data like this - https://github.com/karuppiah7890/issues-info/blob/main/etcd-io/etcd/issue-13167/commit-and-check-data.json

karuppiah7890 commented 3 years ago

I'm able to get the number of successes and we can get failures too. Given total (for example 100) and any one of those (successes / failures), we get the other value too

serathius commented 3 years ago

Great! Would you be interested in sending PR that adds it to etcd scripts ?

karuppiah7890 commented 3 years ago

Sure @serathius ! I was also wondering if I should try out a golang script too, so anyone can run it with just "go run" or similar on any platform. No need to worry about OS, bash shell being available, other tools being available etc. What do you think?

serathius commented 3 years ago

Letting everyone to run it is a good initiative, but on the other hand long term we should just automate it. Most scripts are already written in bash and I don't think there is any need to invest in this script too much. It should be simple enough (2-3 commands) that it could be replaced when needed.

I think it would make sense revisit those improvements when we have established whole process and automated it.

karuppiah7890 commented 3 years ago

Makes sense @serathius ! 👍 I'll raise the PR and we can discuss more about the bash script as part of the PR

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

karuppiah7890 commented 3 years ago

commenting to avoid closing of issue

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Rajalakshmi-Girish commented 2 years ago

I still see flakiness when unit tests are run.

endocrimes commented 2 years ago

I hacked together a tool for finding/tracking/fixing flakes the other day: https://github.com/endocrimes/etcd-test-analyzer

Because it parses all of the test results from every run in a given time period, it makes it relatively easy to modify to ask new questions in place, but definitely isn't a tool that is widely useful in its current form.

serathius commented 2 years ago

Status update, running ./scripts/measure-test-flakiness.sh gave me:

Commit status failure percentage is - 23 %

So on last 100 merged commits we got 24 test failures. Excluding 7 coverage failures (not blocking merge) and 2 recent failures due to post merge bug https://github.com/etcd-io/etcd/pull/14101, we get 14% flakiness.

Going down from 50% to 14% is great result!! Thanks everyone who helped.

serathius commented 2 years ago

Looking into failures from last 100 runs (excluding coverage and known issues) we get failures in:

4 failures in e2e tests of TestDowngradeUpgradeClusterOf3 (example) - @serathius
4 failures in functional tests of BLACKHOLE_PEER_PORT_TX_RX_LEADER (example)
4 failures in functional tests of NO_FAIL_WITH_NO_STRESS_FOR_LIVENESS (example)
2 timeouts in grpcproxy test of TestLeasingReconnectOwnerConsistency (example)
2 failures in grpcproxy test of TestWatchCancelOnServer (example)
1 failure in integration test of TestDropReadUnderNetworkPartition (example) (possible goroutine leak in previous test)
1 failure in integration test of TestBalancerUnderNetworkPartitionTxn
1 failure in integration test of TestAuthority (example)
1 timeout in grpcproxy test of TestLeasingReconnectNonOwnerGet (example)
1 failure in integration tests of TestMaxLearnerInCluster (example)
1 failure in functional tests of DELAY_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT (example)

serathius commented 2 years ago

As there are a lot of tests would be great to get some help. Please let me know if you are interested in tackling one of the tests listed.

serathius commented 2 years ago

Status: 28% flakiness https://github.com/etcd-io/etcd/actions/runs/3075242479/jobs/4968492878

chaochn47 commented 2 years ago

Thanks for raising this issue. It is really annoying for any contributors to etcd that unrelated tests failed.

I can take one TestDowngradeUpgradeClusterOf3 because I just faced in https://github.com/etcd-io/etcd/pull/14331. It's also a good opportunity to learn how downgrade works as well.

Track this in https://github.com/etcd-io/etcd/issues/14540

serathius commented 1 year ago

I noticed recent increase in flakes (at least in my PRs). From https://github.com/etcd-io/etcd/actions/runs/4394774437/jobs/7696017126 we see 26% of flakiness.

Loved recent initiative by @chaochn47 to use tools developed by @endocrimes in https://github.com/etcd-io/etcd/pull/15501.

It would be great to integrate them into https://github.com/etcd-io/etcd/actions/workflows/measure-test-flakiness.yaml @chaochn47 would you be interested in this?

chaochn47 commented 1 year ago

Yeah, I can help add to the existing workflow. ETA next Monday

nitishfy commented 8 months ago

Hi, I'd like to work on this!

serathius commented 8 months ago

Thanks @nitishfy for your interest. The issue was created some time ago so not everything is up to date, however high level goals remained relevant. We want to improve our visibility of test flakes so we can fix them more effectively.

For the original plan, we have instrumented etcd e2e tests to export JUnit reports, @endocrimes and @karuppiah7890 implemented some custom scripts that would analyse them. This approach allowed us to start reporting and manually creating issues to fix flakes.

One thing we can do better is to avoid developing our own scripting, etcd community is not very big, so we want to avoid spreading too thin maintaining too many custom tools. With introduction of SIG-etcd we now have a option to benefit from whole ecosystem of tools built by Kubernetes community. We should do that.

One example of such tool is testgrid, it's a test result visualization tool that uses the same JUnit reports to create a grid showing which tests passed and which failed. It makes it really easy to track flakes. For example https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-e2e-amd64

I think we should work more on integrating with K8s tools, this first requires migrating etcd testing to Prow, K8s CI tool. This work can be tracked in https://github.com/kubernetes/k8s.io/issues/6102.

In the meantime we could improve ensure that all etcd tests generate a Junit report, that can be later used.

Looking at github workflows only in https://github.com/etcd-io/etcd/blob/main/.github/workflows/tests-template.yaml We set JUNIT_REPORT_DIR and export junit files https://github.com/etcd-io/etcd/blob/11ff2644f2378e80a461d7dacfe3ad151c37f26e/.github/workflows/tests-template.yaml#L69-L73 we should look into adding it to more test scenarios.