etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.45k stars 9.73k forks source link

Measure testgrid flakinesss didn't detect flakes that happend #17773

Open serathius opened 5 months ago

serathius commented 5 months ago

What would you like to be added?

https://github.com/etcd-io/etcd/actions/runs/8584759086/job/23525383783 image

image cc @siyuanfoundation

Why is this needed?

Last run https://github.com/etcd-io/etcd/actions/runs/8584759086/job/23525383783 on April 7th, didn't detect a flake on April 4th.

siyuanfoundation commented 5 months ago

The flaky detection is meant to detect tests that fails sometimes, not one-off failures. this test fails about 2% of the time. @serathius Do you think 2% threshold is reasonable?

serathius commented 5 months ago

Hmm, not sure. The 16 % flakiness on main branch seems not great https://github.com/etcd-io/etcd/actions/runs/8584643093/job/23525115058.

siyuanfoundation commented 5 months ago

I think the 16 % flakiness on main branch includes all the workflows on a PR. I am seeing a lot of flakiness wrt arm64.

jmhbnz commented 5 months ago

I think the 16 % flakiness on main branch includes all the workflows on a PR. I am seeing a lot of flakiness wrt arm64.

Raised a flake issue for TestMemberAdd e2e on arm64 and amd64. I have seen it fail a few times in GitHub actions for arm64 and there are also instances in TestGrid for amd64 on prow.

https://github.com/etcd-io/etcd/issues/17778

serathius commented 5 months ago

The flaky detection is meant to detect tests that fails sometimes, not one-off failures. this test fails about 2% of the time. @serathius Do you think 2% threshold is reasonable?

Maybe we could improve on visibility. What was surprising for me was fact that the tool didn't mention any flakes. Could we maybe log flakes below 2%, with note that it's too low to file an issue?

serathius commented 4 months ago

The reports are very nice. image

My suggestions:

To go into more detail, lets define a measure of bad contributor experience due to CI, something like time wasted on CI to merge PR. I would call this TTM - time to merge, a reflection of how long it takes to test a PR and flakiness of those test. I would expect TTM to equal something like max(TSDi^(1+TSFi) for each i) where TSDi - duration of test suite i, TSFi - flakiness of test suite i. Because retries can be done on test suite level, we need to count it per suite. If we set a target for TTM, different suites might have different acceptable flakiness as it's easier and faster to retry 1 minute test, than 30 minute one. Of course it assumes that notice failure and retry is zero which is a simplification. However this is high level my mental model of the problem. If we include the TTR - time to retry the TTM=max(TSDi^TSFi+(TSDi+TTR)^TSFi for each i)