dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

test_worker_heartbeat_after_cancel hangs on MacOS 3.7/3.8 #5056

Closed fjetter closed 3 years ago

fjetter commented 3 years ago

With the latest CI improvements, I receive failures which look like this. I assume this is a GH actions script this is pointing to but I have no idea what this particular error is about.

@crusaderky have you encountered anything like this in your tests?

https://github.com/fjetter/distributed/runs/3064775791?check_suite_focus=true

/Users/runner/work/_temp/eec694e8-9cbc-41e2-9505-f187841b5aa3.sh: line 9:  6519 Abort trap: 6           pytest distributed -m "not avoid_ci" --runslow --junitxml reports/pytest.xml -o junit_suite_name=$TEST_ID
distributed/tests/test_scheduler.py::test_worker_heartbeat_after_cancel 
Error: Process completed with exit code 134.
fjetter commented 3 years ago

I found https://github.com/actions/virtual-environments/issues/1296 but can't really make any sense of it

fjetter commented 3 years ago

Seems to consistently affect OSX py3.7 + py3.8

crusaderky commented 3 years ago

That's pytest-timeout that just sent SIGABRT. I was hoping it would kill off the individual test on MacOS like it does on Linux. Clearly it's not like that. I will revert to timeout_method = thread for MacOS.

fjetter commented 3 years ago

fwiw, the signal method works on my machine w/ OSX

crusaderky commented 3 years ago

I'm getting it too on my own branches. Can't figure out why having a SIGALRM can possibly cause the test to time out, so I suspect it's a red herring - e.g. it will continue to fall over once we disable SIGALARM, just with the old error message.

That test was not touched in #5022. Also the fact that pytest-timeout is kicking in, much after the 60s timeout of gen_cluster, strongly hints at the fact that it's gen_cluster, not the test itself, that's getting stuck. I suspect it has to do with having 10 workers, except that there are other tests with 8-10 workers and they aren't glitching. Also can't figure out why it never cropped up during my last iteration of stress tests.

crusaderky commented 3 years ago

And in fact, without SIGABRT we get the exact same behaviour: always fails on MacOS 3.7 and 3.8 https://github.com/dask/distributed/pull/5057/checks?check_run_id=3067949864

crusaderky commented 3 years ago

Wait. The test was added yesterday in #5053. I think that the CI changes in #5022 are completely coincidental.

crusaderky commented 3 years ago

5057 takes this off the critical path. I suggest renaming this issue to "test_worker_heartbeat_after_cancel hangs on MacOS 3.7/3.8"

fjetter commented 3 years ago

I couldn't reproduce the hanging test locally. However, what's interesting, is that I could see a lot of "loop unresponsive" messages when running the test under a lot of load. Still, idk why it hung. The new test logic is less prone to instabilities