Closed fjetter closed 3 years ago
I found https://github.com/actions/virtual-environments/issues/1296 but can't really make any sense of it
Seems to consistently affect OSX py3.7 + py3.8
That's pytest-timeout that just sent SIGABRT. I was hoping it would kill off the individual test on MacOS like it does on Linux. Clearly it's not like that. I will revert to timeout_method = thread for MacOS.
fwiw, the signal method works on my machine w/ OSX
I'm getting it too on my own branches. Can't figure out why having a SIGALRM can possibly cause the test to time out, so I suspect it's a red herring - e.g. it will continue to fall over once we disable SIGALARM, just with the old error message.
That test was not touched in #5022. Also the fact that pytest-timeout is kicking in, much after the 60s timeout of gen_cluster, strongly hints at the fact that it's gen_cluster, not the test itself, that's getting stuck. I suspect it has to do with having 10 workers, except that there are other tests with 8-10 workers and they aren't glitching. Also can't figure out why it never cropped up during my last iteration of stress tests.
And in fact, without SIGABRT we get the exact same behaviour: always fails on MacOS 3.7 and 3.8 https://github.com/dask/distributed/pull/5057/checks?check_run_id=3067949864
Wait. The test was added yesterday in #5053. I think that the CI changes in #5022 are completely coincidental.
I couldn't reproduce the hanging test locally. However, what's interesting, is that I could see a lot of "loop unresponsive" messages when running the test under a lot of load. Still, idk why it hung. The new test logic is less prone to instabilities
With the latest CI improvements, I receive failures which look like this. I assume this is a GH actions script this is pointing to but I have no idea what this particular error is about.
@crusaderky have you encountered anything like this in your tests?
https://github.com/fjetter/distributed/runs/3064775791?check_suite_focus=true