Measure testgrid flakinesss didn't detect flakes that happend

serathius commented 5 months ago

What would you like to be added?

https://github.com/etcd-io/etcd/actions/runs/8584759086/job/23525383783

cc @siyuanfoundation

Why is this needed?

Last run https://github.com/etcd-io/etcd/actions/runs/8584759086/job/23525383783 on April 7th, didn't detect a flake on April 4th.

siyuanfoundation commented 5 months ago

The flaky detection is meant to detect tests that fails sometimes, not one-off failures. this test fails about 2% of the time. @serathius Do you think 2% threshold is reasonable?

serathius commented 5 months ago

Hmm, not sure. The 16 % flakiness on main branch seems not great https://github.com/etcd-io/etcd/actions/runs/8584643093/job/23525115058.

siyuanfoundation commented 5 months ago

I think the 16 % flakiness on main branch includes all the workflows on a PR. I am seeing a lot of flakiness wrt arm64.

jmhbnz commented 5 months ago

I think the 16 % flakiness on main branch includes all the workflows on a PR. I am seeing a lot of flakiness wrt arm64.

Raised a flake issue for TestMemberAdd e2e on arm64 and amd64. I have seen it fail a few times in GitHub actions for arm64 and there are also instances in TestGrid for amd64 on prow.

https://github.com/etcd-io/etcd/issues/17778

serathius commented 5 months ago

The flaky detection is meant to detect tests that fails sometimes, not one-off failures. this test fails about 2% of the time. @serathius Do you think 2% threshold is reasonable?

Maybe we could improve on visibility. What was surprising for me was fact that the tool didn't mention any flakes. Could we maybe log flakes below 2%, with note that it's too low to file an issue?

serathius commented 4 months ago

The reports are very nice.

My suggestions:

Make them easier to find, maybe use https://github.com/marketplace/actions/publish-test-report to pushish a report in summary
10% per test threshold is very high so it will not report anything, from contributor perspective I don't care about a flakiness of a single test. I care about my PR having flakes wasting my time on retries. I would recommend to change the threshold to be per suite, if the whole suite flakiness is above 10%, we file an issue for the most flaky tests. This way we catch cases of tests with low flakiness not being a problem individually, but in aggregate. Like 10 tests with flakiness of 1%. We can start from reporting just the top flaky test in the suite, we can iterate on it later.

To go into more detail, lets define a measure of bad contributor experience due to CI, something like time wasted on CI to merge PR. I would call this TTM - time to merge, a reflection of how long it takes to test a PR and flakiness of those test. I would expect TTM to equal something like max(TSDi^(1+TSFi) for each i) where TSDi - duration of test suite i, TSFi - flakiness of test suite i. Because retries can be done on test suite level, we need to count it per suite. If we set a target for TTM, different suites might have different acceptable flakiness as it's easier and faster to retry 1 minute test, than 30 minute one. Of course it assumes that notice failure and retry is zero which is a simplification. However this is high level my mental model of the problem. If we include the TTR - time to retry the TTM=max(TSDi^TSFi+(TSDi+TTR)^TSFi for each i)

etcd-io / etcd

Measure testgrid flakinesss didn't detect flakes that happend #17773

What would you like to be added?

Why is this needed?