fjetter commented 2 years ago

With a couple of recent merges, I triggered yesterday another "CI stress test" that runs our suite a couple of times in a row (this time 10)

see https://github.com/fjetter/distributed/tree/stress_ci which is based on https://github.com/dask/distributed/commit/dc019ed2398411549cdf738be4327a601ec7dfca with https://github.com/fjetter/distributed/commit/68689f0c9c41f0b1faf91888eea8d0be7745338a on top

The results of this test run can be seen https://github.com/fjetter/distributed/runs/7029246894?check_suite_focus=true

Summary

We had overall 80 total jobs spread on the different OSs and python versions of which 32 failed.

OS	success	failure	total	success rate
Ubuntu	28	2	30	93%
Windows	13	17	30	43%
OSX	7	13	20	35%

If we look at an entire test run, i.e. a full test matrix for a given run number, not a single job would've been successful.

Looking at the kinds of test failures, we will see that three jobs on windows failed due to a GH actions test timeout of 120s. And two test runs where cancelled by github without further information, also on windows. The timing out test runs do not have anything obvious in common. In fact, one of the three timed out tests appears to have finished running the pytest suite but still timed out.

A modified test report available here https://gistpreview.github.io/?ecc2cdddf651df9ee0c7966e210c9093

crusaderky commented 2 years ago

Does this test include #6591?

1.

I see that you ran on fjetter/distributed. I got a feeling that whenever I run on crusaderky/distributed I get a lot more random failures than on dask/distributed - likely because if you're in the free tier you have a higher chance of getting older/cheaper hardware?

2.

I strongly suspect a lot of failures may be related to https://github.com/dask/distributed/pull/6271. How many of the failed tests were using nannies? How many were using more than 2 nannies? CI hosts mount 2 CPUs each.

2b.

I think we could figure out a straightforward way to mass-skip all tests decorated with @gen_cluster(Worker=Nanny) and see how many failures remain.

fjetter commented 2 years ago

Does this test include https://github.com/dask/distributed/pull/6591?

yes

ikely because if you're in the free tier you have a higher chance of getting older/cheaper hardware?

the dask project is also on free tier, isn't it?

fjetter commented 2 years ago

Most of the windows tests are failing because of a disk permission problem during cleanup. @graingert suggested that using the pytest fixtures instead of tempfile would help with this.

fjetter commented 2 years ago

New update based on a8eb3b23b8fe91f52758db155e7151e3d516cbdc

using branch https://github.com/fjetter/distributed/tree/stress_ci

again running 10 iterations, see https://github.com/fjetter/distributed/commit/3fe45b401b9d31c50cb646f06421e067a6657fab

Test reports and summary was generated with code available here https://github.com/fjetter/distributed/commit/6be77905ce4e880e084b8095657a026fda99d409

Test report is a again available at https://gistpreview.github.io/?ecc2cdddf651df9ee0c7966e210c9093/a8eb3b23b8fe91f52758db155e7151e3d516cbdc.html

	cancelled	failure	success	total	success_rate
('macos-latest', '3.10')	0	7	3	10	0.3
('macos-latest', '3.8')	0	10	0	10	0
('ubuntu-latest', '3.10')	0	0	10	10	1
('ubuntu-latest', '3.8')	0	2	8	10	0.8
('ubuntu-latest', '3.9')	0	0	10	10	1
('windows-latest', '3.10')	0	5	5	10	0.5
('windows-latest', '3.8')	1	1	8	10	0.8
('windows-latest', '3.9')	3	2	5	10	0.5

On average, every full test matrix included 2.7 failures. There was no full test matrix successful.

We can observe, again, a couple of systematic problems

CLI tests are timing out. They appear to try to connect to a scheduler that is not coming up. Judging from the logs, the scheduler is stuck in init somewhere
A couple of attempts to schedule new coroutines after our new AsyncTaskGroup already closed, e.g. distributed.deploy.tests.test_local.test_close_twice
A couple of unclosed comms or unclosed clusters (likely an with ... missing in the individual tests)
A couple of leaked threads
Nanny startup failures due to windows disk problems (e.g. test_chaos_rechunk https://github.com/dask/distributed/issues/6641)
Very likely still stuff I missed

fjetter commented 2 years ago

New update based on https://github.com/dask/distributed/commit/e1b9e20fde946194858165a8b91cb94703c715aa

using branch https://github.com/fjetter/distributed/tree/stress_ci

test_report.html.zip

High level summary

	failure	success	total	success_rate
('macos-latest', '3.10')	0	10	10	1
('macos-latest', '3.8')	10	0	10	0
('ubuntu-latest', '3.10')	3	7	10	0.7
('ubuntu-latest', '3.8')	4	6	10	0.6
('ubuntu-latest', '3.9')	2	8	10	0.8
('windows-latest', '3.10')	0	10	10	1
('windows-latest', '3.8')	0	10	10	1
('windows-latest', '3.9')	2	8	10	0.8

We're performing already much better; with the exception of OSX py3.8 where not a single run was successful. Below a few detailed reports about the kinds of errors encountered

Detailed error report

This is a groupby on the truncated error message as a proxy for a fuzzy match

message_trunc	test	PR with possible fix
AssertionError: assert 3 == 1	['test_avoid_churn']
AssertionError: assert False	['test_restart_waits_for_new_workers']
assert not b"Future excep	['test_quiet_close_process[False]']	https://github.com/dask/distributed/pull/6857
asyncio.exceptions.TimeoutErro	['test_reconnect' 'test_shutdown_localcluster' 'test_wait_for_scheduler']
failed on teardown with "	['test_broken_worker' 'test_local_cluster_redundant_kwarg[True]']	https://github.com/dask/distributed/pull/6865 or https://github.com/dask/distributed/pull/6863
pytest.PytestUnraisableExcepti	['test_local_client_warning' 'test_release_retry' 'test_client_cluster_synchronous' 'test_run_spec' 'test_dont_select_closed_worker']	https://github.com/dask/distributed/pull/6865 or https://github.com/dask/distributed/pull/6863

Full error messages

| message_trunc | message | test | |--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------| | AssertionError: assert 3 == 1 | ['AssertionError: assert 3 == 1\n + where 3 = len(deque([(1660136972.0109015, {'n': 1, 'status': 'up'}), (1660136972.6130333, {'status': 'down', 'workers': [0]}), (1660136972.6989691, {'n': 1, 'status': 'up'})]))\n + where deque([(1660136972.0109015, {'n': 1, 'status': 'up'}), (1660136972.6130333, {'status': 'down', 'workers': [0]}), (1660136972.6989691, {'n': 1, 'status': 'up'})]) = <distributed.deploy.adaptive.Adaptive object at 0x000002420E9220A0>.log'] | ['test_avoid_churn'] | | AssertionError: assert False | ['AssertionError: assert False\n + where False = <bound method Set.isdisjoint of SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}))>(dict_keys(['tcp://127.0.0.1:33567', 'tcp://127.0.0.1:37071', 'tcp://127.0.0.1:38955', 'tcp://127.0.0.1:39951', 'tcp://127.0.0.1:46081']))\n + where <bound method Set.isdisjoint of SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}))> = SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})).isdisjoint\n + where SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})) = <bound method SortedDict.keys of SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})>()\n + where <bound method SortedDict.keys of SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})> = SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}).keys\n + where SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}) = <Scheduler 'tcp://127.0.0.1:37253', workers: 5, cores: 5, tasks: 0>.workers\n + and dict_keys(['tcp://127.0.0.1:33567', 'tcp://127.0.0.1:37071', 'tcp://127.0.0.1:38955', 'tcp://127.0.0.1:39951', 'tcp://127.0.0.1:46081']) = <built-in method keys of dict object at 0x7f139cf1f640>()\n + where <built-in method keys of dict object at 0x7f139cf1f640> = {'tcp://127.0.0.1:33567': <WorkerState 'tcp://127.0.0.1:33567', name: 2, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:37071': <WorkerState 'tcp://127.0.0.1:37071', name: 3, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 1, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:39951': <WorkerState 'tcp://127.0.0.1:39951', name: 0, status: closed, memory: 0, processing: 0>, ...}.keys'] | ['test_restart_waits_for_new_workers'] | | assert not b"Future excep | ['assert not b"Future exception was never retrieved\\nfuture: <Future finished exception=TimeoutError('Timeout')>\\ntornado.util.TimeoutError: Timeout\\n"'] | ['test_quiet_close_process[False]'] | | asyncio.exceptions.TimeoutErro | ['asyncio.exceptions.TimeoutError'] | ['test_reconnect' 'test_shutdown_localcluster' 'test_wait_for_scheduler'] | | failed on teardown with " | ['failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7fe0dce7d0e0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f47a9344240>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f9e60358b40>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f139614b940>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7f3f7afa7c40>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited"'] | ['test_broken_worker' 'test_local_cluster_redundant_kwarg[True]'] | | pytest.PytestUnraisableExcepti | ['pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f776c0c2c00>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7ff32edb3b50>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f3c97bdccc0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7f3c988d12c0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f37f936aac0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited'] | ['test_local_client_warning' 'test_release_retry' 'test_client_cluster_synchronous' 'test_run_spec' 'test_dont_select_closed_worker'] |

fjetter commented 2 years ago

The timeout errors in row three appear to throw a couple of distributed.core.AsyncTaskGroupClosedError: Cannot schedule a new coroutine function as the group is already closed. errors

dask / distributed

Update CI stability #6625

Summary

1.

2.

2b.

High level summary

Detailed error report