etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.65k stars 9.75k forks source link

Flaky test `TestRevisionPause ` #17548

Open ahrtr opened 7 months ago

ahrtr commented 7 months ago

Which Github Action / Prow Jobs are flaking?

pull-etcd-unit-test

Which tests are flaking?

TestRevisionPause

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/17546/pull-etcd-unit-test/1765815515617431552

Reason for failure (if possible)

{Failed  === RUN   TestRevisionPause
    revision_test.go:102: len(actions) = 0, expected >= 1
--- FAIL: TestRevisionPause (10.04s)
}

Anything else we need to know?

No response

ah8ad3 commented 7 months ago

I might be able to fix this if it's ok i'll take a look at this.

ahrtr commented 7 months ago

/assign @ah8ad3

Thanks

ah8ad3 commented 7 months ago

This issue can be produced by making this https://github.com/etcd-io/etcd/blob/ddf54715bf47b8c7ec2f1257b25cd0981d081ae9/client/pkg/testutil/recorder.go#L90 timeout 0 in local. We can increase this timeout and hope it got fixed, or we can rewrite some parts of code (idk how much of work maybe rewriting test or even change the revision code to make it less dependable on timeout). What do you think? @ahrtr

qincwang commented 7 months ago

This might be a small chance flakiness problem, as I run the tests 10k+ times in local

image

we can bump up timeout in this test's recorder to be 10 secs, this would further lower the chance of flakiness, as for test performance concerns, since this is rarely flaky and successful test runs will not exceed the timeout, overall test performance will not be affected

created this PR to increase the timeout https://github.com/etcd-io/etcd/pull/17621 @ah8ad3 @ahrtr

ivanvc commented 7 months ago

Hi @qincwang, could you run the tests with stress? The instructions on how to do it are in the etcd community meeting document. If you left it running for a while, it would give a failure percentage.

qincwang commented 7 months ago

Hi @ivanvc running the command for 1h40mins & 1million+ runs for this orignal test and get first error , referring this PR https://github.com/etcd-io/etcd/pull/17586, will try to remove timemout dependency on this and keep stress run for more time to validate, could I be assigned this issue, thx image

jmhbnz commented 7 months ago

Hi team - Thank you for your progress on this issue.

Speaking as a sig-etcd co-chair I want to please ask that we all be mindful of issue assignments in future. This issue was assigned to @ah8ad3.

In the interest of growing our collaborative community I would have preferred the pull request resolving this issue to have come from the initial assignee who volunteered to work on this first. Normally if an issue is already assigned we would expect other contributors to ask permission before also raising any pr for the same issue.

Given work has already been happening in this instance, I will also assign @qincwang in the hope you can both work together to finalise any testing or pull requests. Additionally please consider co-authoring commits in github to ensure all contributors are recognised.

Thank you for your kind consideration 🙏🏻

/assign @qincwang

ah8ad3 commented 7 months ago

Thanks @jmhbnz for clarification. Since i'm new to this community i was waiting for the @ahrtr response but anyway i'll try to stress it.

qincwang commented 7 months ago

Hi Team, sorry for any confusion or concerns. I am also new to the community and initialized the PR in the hope of some more concrete discussions around the code. Will make sure all work and contributors are regonized in the PR.

ahrtr commented 7 months ago

Sorry for the late response. Will take a closer look at this issue sometime next week.

ahrtr commented 7 months ago

It seems that we just need to follow the same way as https://github.com/etcd-io/etcd/pull/17513 ? Let's get #17513 merged firstly.

ah8ad3 commented 7 months ago

Sorry i was away for few days, i run tests with stress and race couldn't reproduce the error. Code here is pretty old (It got refactored https://github.com/etcd-io/etcd/pull/8123 and it may be older than 7 years). i think we need more of https://github.com/etcd-io/etcd/pull/17513 ,thanks for pointing that out @ahrtr, for these parts to refactor and make them less dependence on timeouts.