Flaky timeout in github Python unit test action StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer

apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.

https://beam.apache.org/

Apache License 2.0

7.88k stars 4.26k forks source link

Flaky timeout in github Python unit test action StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer #21706

Closed damccorm closed 2 weeks ago

damccorm commented 2 years ago

Causing unit test failures in nightly

https://github.com/apache/beam/runs/6150980228?check_suite_focus=true#step:6:112

https://github.com/apache/beam/runs/6143954692?check_suite_focus=true#step:6:308

https://github.com/apache/beam/runs/6137479897?check_suite_focus=true#step:6:112 @pytest.mark.timeout(3) 118 def test_dynamic_timer_clear_then_set_timer(self): E Failed: Timeout >3.0s

Imported from Jira BEAM-14367. Original Jira may contain additional context. Reported by: yihu.

damccorm commented 2 years ago

Unable to assign user @AnandInguva. If able, self-assign, otherwise tag @damccorm so that he can assign you. Because of GitHub's spam prevention system, your activity is required to enable assignment in this repo.

Abacn commented 2 years ago

Still observed after #17569 though less frequently: https://github.com/apache/beam/runs/6760704625?check_suite_focus=true

aaltay commented 2 years ago

Who would be a good owner for this issue? @tvalentyn @AnandInguva ?

Unable to assign user @AnandInguva. @damccorm - What is limitation for doing that?

damccorm commented 2 years ago

@damccorm - What is limitation for doing that?

It requires a user to interact with the issue or have some level of permission (e.g. committer) in the repo before assignment.

aaltay commented 2 years ago

@damccorm - What is limitation for doing that?

It requires a user to interact with the issue or have some level of permission (e.g. committer) in the repo before assignment.

Thank you. Could we clarify this somewhere, and also provide guidance for people what should they do if they cannot assign an issue. (Example a similar issue Sachin faced here https://github.com/apache/beam/issues/21741 - created the issue but could not assign to Reza. I was able to assign it. I guess because both I and Reza are committers?)

damccorm commented 2 years ago

Thank you. Could we clarify this somewhere, and also provide guidance for people what should they do if they cannot assign an issue. (Example a similar issue Sachin faced here https://github.com/apache/beam/issues/21741 - created the issue but could not assign to Reza. I was able to assign it. I guess because both I and Reza are committers?)

Sure! I already had a pr to add some automation to make this easier with other doc changes, so I bundled this guidance into that change: https://github.com/apache/beam/pull/21719

aaltay commented 2 years ago

Thank you. Could we clarify this somewhere, and also provide guidance for people what should they do if they cannot assign an issue. (Example a similar issue Sachin faced here #21741 - created the issue but could not assign to Reza. I was able to assign it. I guess because both I and Reza are committers?)

Sure! I already had a pr to add some automation to make this easier with other doc changes, so I bundled this guidance into that change: #21719

Thank you. This is an improvement. I understand why this is a limitation in general, but it is a bit unfortunate. I feel like we will end up with the workflow of file an issue, and tag someone to comment so that they can assign the bug to themselves, but until they do that the dashboards etc will have issues without owners. I think this is an acceptable trade off, but if you can think of any improvements I would take it :)

AnandInguva commented 2 years ago

Who would be a good owner for this issue? @tvalentyn @AnandInguva ?

I will take a look and assign to the appropriate people. Thanks

AnandInguva commented 2 years ago

@pabloem Are you the owner of this test? If yes, could you take a look on why this is flaky? Thanks

kennknowles commented 2 years ago

Clearing the milestone field since this doesn't seem like it is a release blocker. Would this flaking test possibly indicate a problem that makes the release non-functional?

pabloem commented 2 years ago

agreed with Kenn that this is not a 2.40.0 blocker

pabloem commented 2 years ago

I've ran this many many times on my laptop without causing it to fail. I'll close this for now.

Abacn commented 2 years ago

This only flakes on jenkins. e.g. https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/6234/

Abacn commented 2 years ago

I think if it is specific to our (poor) test infrastructure we can downgrade the priority and leave it open for track.

pabloem commented 2 years ago

thanks @Abacn !

kennknowles commented 2 years ago

There is no sign of an infrastructure failure. This appears to be an actually-possible execution that fails. This suggests a race condition to me, and the heavy load on the Jenkins workers causes it.

kennknowles commented 2 years ago

Based on it being a timeout I would guess deadlock (assuming 3 seconds is many orders of magnitude more than a successful run takes). Something to do with a concurrent map used in the dynamic timers? I don't know what this looks like in the Python codebase.

chamikaramj commented 2 years ago

Hi folks,

Is this an actual blocker for the 2.43.0 release ? Seems like it was added to the milestone automatically.

kennknowles commented 1 year ago

@pabloem any update? are you working on this right now?

kennknowles commented 1 year ago

I'm assuming lack of response means that this is on the back burner and could be unassigned?

damccorm commented 2 weeks ago

I think this is fixed. If not, it should get auto-flagged by our tooling anyways, so this should be safe to close