Open fjetter opened 2 years ago
Does this test include #6591?
I see that you ran on fjetter/distributed. I got a feeling that whenever I run on crusaderky/distributed I get a lot more random failures than on dask/distributed - likely because if you're in the free tier you have a higher chance of getting older/cheaper hardware?
I strongly suspect a lot of failures may be related to https://github.com/dask/distributed/pull/6271. How many of the failed tests were using nannies? How many were using more than 2 nannies? CI hosts mount 2 CPUs each.
I think we could figure out a straightforward way to mass-skip all tests decorated with @gen_cluster(Worker=Nanny)
and see how many failures remain.
Does this test include https://github.com/dask/distributed/pull/6591?
yes
ikely because if you're in the free tier you have a higher chance of getting older/cheaper hardware?
the dask project is also on free tier, isn't it?
Most of the windows tests are failing because of a disk permission problem during cleanup. @graingert suggested that using the pytest fixtures instead of tempfile would help with this.
New update based on a8eb3b23b8fe91f52758db155e7151e3d516cbdc
using branch https://github.com/fjetter/distributed/tree/stress_ci
again running 10 iterations, see https://github.com/fjetter/distributed/commit/3fe45b401b9d31c50cb646f06421e067a6657fab
Test reports and summary was generated with code available here https://github.com/fjetter/distributed/commit/6be77905ce4e880e084b8095657a026fda99d409
Test report is a again available at https://gistpreview.github.io/?ecc2cdddf651df9ee0c7966e210c9093/a8eb3b23b8fe91f52758db155e7151e3d516cbdc.html
cancelled | failure | success | total | success_rate | |
---|---|---|---|---|---|
('macos-latest', '3.10') | 0 | 7 | 3 | 10 | 0.3 |
('macos-latest', '3.8') | 0 | 10 | 0 | 10 | 0 |
('ubuntu-latest', '3.10') | 0 | 0 | 10 | 10 | 1 |
('ubuntu-latest', '3.8') | 0 | 2 | 8 | 10 | 0.8 |
('ubuntu-latest', '3.9') | 0 | 0 | 10 | 10 | 1 |
('windows-latest', '3.10') | 0 | 5 | 5 | 10 | 0.5 |
('windows-latest', '3.8') | 1 | 1 | 8 | 10 | 0.8 |
('windows-latest', '3.9') | 3 | 2 | 5 | 10 | 0.5 |
On average, every full test matrix included 2.7 failures. There was no full test matrix successful.
We can observe, again, a couple of systematic problems
distributed.deploy.tests.test_local.test_close_twice
with ...
missing in the individual tests)test_chaos_rechunk
https://github.com/dask/distributed/issues/6641)New update based on https://github.com/dask/distributed/commit/e1b9e20fde946194858165a8b91cb94703c715aa
using branch https://github.com/fjetter/distributed/tree/stress_ci
failure | success | cancelled | total | success_rate | |
---|---|---|---|---|---|
('macos-latest', '3.10') | 0 | 10 | 0 | 10 | 1 |
('macos-latest', '3.8') | 10 | 0 | 0 | 10 | 0 |
('ubuntu-latest', '3.10') | 3 | 7 | 0 | 10 | 0.7 |
('ubuntu-latest', '3.8') | 4 | 6 | 0 | 10 | 0.6 |
('ubuntu-latest', '3.9') | 2 | 8 | 0 | 10 | 0.8 |
('windows-latest', '3.10') | 0 | 10 | 0 | 10 | 1 |
('windows-latest', '3.8') | 0 | 10 | 0 | 10 | 1 |
('windows-latest', '3.9') | 2 | 8 | 0 | 10 | 0.8 |
We're performing already much better; with the exception of OSX py3.8 where not a single run was successful. Below a few detailed reports about the kinds of errors encountered
This is a groupby on the truncated error message as a proxy for a fuzzy match
message_trunc | test | PR with possible fix |
---|---|---|
AssertionError: assert 3 == 1 | ['test_avoid_churn'] | |
AssertionError: assert False | ['test_restart_waits_for_new_workers'] | |
assert not b"Future excep | ['test_quiet_close_process[False]'] | https://github.com/dask/distributed/pull/6857 |
asyncio.exceptions.TimeoutErro | ['test_reconnect' 'test_shutdown_localcluster' 'test_wait_for_scheduler'] | |
failed on teardown with " | ['test_broken_worker' 'test_local_cluster_redundant_kwarg[True]'] | https://github.com/dask/distributed/pull/6865 or https://github.com/dask/distributed/pull/6863 |
pytest.PytestUnraisableExcepti | ['test_local_client_warning' 'test_release_retry' 'test_client_cluster_synchronous' 'test_run_spec' 'test_dont_select_closed_worker'] | https://github.com/dask/distributed/pull/6865 or https://github.com/dask/distributed/pull/6863 |
The timeout errors in row three appear to throw a couple of distributed.core.AsyncTaskGroupClosedError: Cannot schedule a new coroutine function as the group is already closed.
errors
With a couple of recent merges, I triggered yesterday another "CI stress test" that runs our suite a couple of times in a row (this time 10)
see https://github.com/fjetter/distributed/tree/stress_ci which is based on https://github.com/dask/distributed/commit/dc019ed2398411549cdf738be4327a601ec7dfca with https://github.com/fjetter/distributed/commit/68689f0c9c41f0b1faf91888eea8d0be7745338a on top
The results of this test run can be seen https://github.com/fjetter/distributed/runs/7029246894?check_suite_focus=true
Summary
We had overall 80 total jobs spread on the different OSs and python versions of which 32 failed.
If we look at an entire test run, i.e. a full test matrix for a given run number, not a single job would've been successful.
Looking at the kinds of test failures, we will see that three jobs on windows failed due to a GH actions test timeout of 120s. And two test runs where cancelled by github without further information, also on windows. The timing out test runs do not have anything obvious in common. In fact, one of the three timed out tests appears to have finished running the pytest suite but still timed out.
A modified test report available here https://gistpreview.github.io/?ecc2cdddf651df9ee0c7966e210c9093