dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Skip collecting coverage for CLI tests #8930

Closed fjetter closed 1 week ago

fjetter commented 2 weeks ago

We're frequently seeing weird deadlocks / stuck tests when running CLI tests. I've tuned the timeouts a couple of times and for some this was indeed helpful.

I've been looking into this again and saw in https://github.com/dask/distributed/actions/runs/11736945020/job/32696964203 an interesting traceback

E           subprocess.TimeoutExpired: Command '['/home/runner/miniconda3/envs/dask-distributed/bin/dask', 'worker', 'tcp://127.0.0.1:34103', '--nworkers=2', '--no-nanny']' timed out after 10 seconds

../../../miniconda3/envs/dask-distributed/lib/python3.10/subprocess.py:1198: TimeoutExpired
----------------------------- Captured stdout call -----------------------------
b'2024-11-08 06:03:36,069 - distributed.dask_worker - ERROR - Failed to launch worker.  You cannot use the --no-nanny argument when n_workers > 1.\n'
------ stdout: returncode -9, ['/home/runner/miniconda3/envs/dask-distributed/bin/dask', 'worker', 'tcp://127.0.0.1:34103', '--nworkers=2', '--no-nanny'] ------
Exception ignored in atexit callback: <function _python_exit at 0x7fb6909fde10>
Traceback (most recent call last):
  File "/home/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/coverage/collector.py", line 252, in lock_data
    self.data_lock.acquire()
KeyboardInterrupt: 

This suggests that coverage has a python atexit hook that is locking up for some reason. Possibly because it cannot write the coverage data out quickly enough.

I hope that a quick fix is to just not collect coverage for the CLI tests

fjetter commented 2 weeks ago

same thing here https://github.com/dask/distributed/actions/runs/11728636755/job/32672527284

I so hope this is it... those failures have been driving me mad

github-actions[bot] commented 2 weeks ago

Unit Test Results

_See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests._

    25 files  ±0      25 suites  ±0   10h 17m 11s ⏱️ -38s  4 129 tests ±0   4 017 ✅ ±0    110 💤 ±0  1 ❌  - 1  1 🔥 +1  47 681 runs  +1  45 559 ✅ +2  2 120 💤  - 1  1 ❌  - 1  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit d9cbd6dc. ± Comparison against base commit 26b10617.

fjetter commented 1 week ago

Well, None of the CLI tests crashed.