dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

`test_AllProgress_lost_key` flaky #4915

Open jrbourbeau opened 3 years ago

jrbourbeau commented 3 years ago

This has been showing up in several CI builds lately (for example https://github.com/dask/distributed/runs/2814505063).

Traceback: ```python 2021-06-13T17:50:12.3994425Z =================================== FAILURES =================================== 2021-06-13T17:50:12.3995912Z __________________________ test_AllProgress_lost_key ___________________________ 2021-06-13T17:50:12.3996481Z 2021-06-13T17:50:12.3997052Z def test_func(): 2021-06-13T17:50:12.3998014Z result = None 2021-06-13T17:50:12.3998648Z workers = [] 2021-06-13T17:50:12.4000042Z with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop: 2021-06-13T17:50:12.4001589Z 2021-06-13T17:50:12.4002225Z async def coro(): 2021-06-13T17:50:12.4003000Z with dask.config.set(config): 2021-06-13T17:50:12.4003897Z s = False 2021-06-13T17:50:12.4004505Z for _ in range(60): 2021-06-13T17:50:12.4005265Z try: 2021-06-13T17:50:12.4005893Z s, ws = await start_cluster( 2021-06-13T17:50:12.4006594Z nthreads, 2021-06-13T17:50:12.4008058Z scheduler, 2021-06-13T17:50:12.4008686Z loop, 2021-06-13T17:50:12.4009563Z security=security, 2021-06-13T17:50:12.4010208Z Worker=Worker, 2021-06-13T17:50:12.4010913Z scheduler_kwargs=scheduler_kwargs, 2021-06-13T17:50:12.4011916Z worker_kwargs=worker_kwargs, 2021-06-13T17:50:12.4012713Z ) 2021-06-13T17:50:12.4013471Z except Exception as e: 2021-06-13T17:50:12.4014154Z logger.error( 2021-06-13T17:50:12.4014868Z "Failed to start gen_cluster: " 2021-06-13T17:50:12.4015594Z f"{e.__class__.__name__}: {e}; retrying", 2021-06-13T17:50:12.4016275Z exc_info=True, 2021-06-13T17:50:12.4016809Z ) 2021-06-13T17:50:12.4017416Z await asyncio.sleep(1) 2021-06-13T17:50:12.4018030Z else: 2021-06-13T17:50:12.4018584Z workers[:] = ws 2021-06-13T17:50:12.4019185Z args = [s] + workers 2021-06-13T17:50:12.4019739Z break 2021-06-13T17:50:12.4020435Z if s is False: 2021-06-13T17:50:12.4021139Z raise Exception("Could not start cluster") 2021-06-13T17:50:12.4021947Z if client: 2021-06-13T17:50:12.4022511Z c = await Client( 2021-06-13T17:50:12.4023072Z s.address, 2021-06-13T17:50:12.4023821Z loop=loop, 2021-06-13T17:50:12.4024447Z security=security, 2021-06-13T17:50:12.4025096Z asynchronous=True, 2021-06-13T17:50:12.4025737Z **client_kwargs, 2021-06-13T17:50:12.4026274Z ) 2021-06-13T17:50:12.4026801Z args = [c] + args 2021-06-13T17:50:12.4027521Z try: 2021-06-13T17:50:12.4028102Z future = func(*args) 2021-06-13T17:50:12.4028698Z if timeout: 2021-06-13T17:50:12.4029407Z future = asyncio.wait_for(future, timeout) 2021-06-13T17:50:12.4030155Z result = await future 2021-06-13T17:50:12.4030758Z if s.validate: 2021-06-13T17:50:12.4031458Z s.validate_state() 2021-06-13T17:50:12.4036496Z finally: 2021-06-13T17:50:12.4037089Z if client and c.status not in ("closing", "closed"): 2021-06-13T17:50:12.4038195Z await c._close(fast=s.status == Status.closed) 2021-06-13T17:50:12.4038927Z await end_cluster(s, workers) 2021-06-13T17:50:12.4039597Z await asyncio.wait_for(cleanup_global_workers(), 1) 2021-06-13T17:50:12.4040148Z 2021-06-13T17:50:12.4040798Z try: 2021-06-13T17:50:12.4041284Z c = await default_client() 2021-06-13T17:50:12.4041817Z except ValueError: 2021-06-13T17:50:12.4042447Z pass 2021-06-13T17:50:12.4042823Z else: 2021-06-13T17:50:12.4043280Z await c._close(fast=True) 2021-06-13T17:50:12.4043681Z 2021-06-13T17:50:12.4044075Z def get_unclosed(): 2021-06-13T17:50:12.4044666Z return [c for c in Comm._instances if not c.closed()] + [ 2021-06-13T17:50:12.4045196Z c 2021-06-13T17:50:12.4045898Z for c in _global_clients.values() 2021-06-13T17:50:12.4046473Z if c.status != "closed" 2021-06-13T17:50:12.4046909Z ] 2021-06-13T17:50:12.4047238Z 2021-06-13T17:50:12.4047868Z try: 2021-06-13T17:50:12.4048300Z start = time() 2021-06-13T17:50:12.4048922Z while time() < start + 60: 2021-06-13T17:50:12.4050039Z gc.collect() 2021-06-13T17:50:12.4050573Z if not get_unclosed(): 2021-06-13T17:50:12.4051026Z break 2021-06-13T17:50:12.4051538Z await asyncio.sleep(0.05) 2021-06-13T17:50:12.4052039Z else: 2021-06-13T17:50:12.4052485Z if allow_unclosed: 2021-06-13T17:50:12.4053105Z print(f"Unclosed Comms: {get_unclosed()}") 2021-06-13T17:50:12.4053636Z else: 2021-06-13T17:50:12.4054261Z raise RuntimeError("Unclosed Comms", get_unclosed()) 2021-06-13T17:50:12.4054865Z finally: 2021-06-13T17:50:12.4055385Z Comm._instances.clear() 2021-06-13T17:50:12.4055967Z _global_clients.clear() 2021-06-13T17:50:12.4056427Z 2021-06-13T17:50:12.4056805Z return result 2021-06-13T17:50:12.4057201Z 2021-06-13T17:50:12.4057612Z > result = loop.run_sync( 2021-06-13T17:50:12.4058223Z coro, timeout=timeout * 2 if timeout else timeout 2021-06-13T17:50:12.4058750Z ) 2021-06-13T17:50:12.4058975Z 2021-06-13T17:50:12.4059423Z distributed/utils_test.py:966: 2021-06-13T17:50:12.4059938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2021-06-13T17:50:12.4061606Z /usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:120: in __exit__ 2021-06-13T17:50:12.4062376Z next(self.gen) 2021-06-13T17:50:12.4062882Z distributed/utils_test.py:1557: in clean 2021-06-13T17:50:12.4063500Z del thread_state.on_event_loop_thread 2021-06-13T17:50:12.4064566Z /usr/share/miniconda3/envs/dask-distributed/lib/python3.8/contextlib.py:120: in __exit__ 2021-06-13T17:50:12.4065324Z next(self.gen) 2021-06-13T17:50:12.4066722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2021-06-13T17:50:12.4067066Z 2021-06-13T17:50:12.4067470Z @contextmanager 2021-06-13T17:50:12.4067950Z def check_instances(): 2021-06-13T17:50:12.4068503Z Client._instances.clear() 2021-06-13T17:50:12.4069092Z Worker._instances.clear() 2021-06-13T17:50:12.4069864Z Scheduler._instances.clear() 2021-06-13T17:50:12.4070502Z SpecCluster._instances.clear() 2021-06-13T17:50:12.4071174Z Worker._initialized_clients.clear() 2021-06-13T17:50:12.4072034Z # assert all(n.status == "closed" for n in Nanny._instances), { 2021-06-13T17:50:12.4072726Z # n: n.status for n in Nanny._instances 2021-06-13T17:50:12.4073179Z # } 2021-06-13T17:50:12.4073610Z Nanny._instances.clear() 2021-06-13T17:50:12.4074164Z _global_clients.clear() 2021-06-13T17:50:12.4074690Z Comm._instances.clear() 2021-06-13T17:50:12.4075120Z 2021-06-13T17:50:12.4075435Z yield 2021-06-13T17:50:12.4075772Z 2021-06-13T17:50:12.4076108Z start = time() 2021-06-13T17:50:12.4076723Z while set(_global_clients): 2021-06-13T17:50:12.4077139Z sleep(0.1) 2021-06-13T17:50:12.4077550Z assert time() < start + 10 2021-06-13T17:50:12.4077911Z 2021-06-13T17:50:12.4078313Z _global_clients.clear() 2021-06-13T17:50:12.4078706Z 2021-06-13T17:50:12.4079109Z for w in Worker._instances: 2021-06-13T17:50:12.4079708Z with suppress(RuntimeError): # closed IOLoop 2021-06-13T17:50:12.4081371Z w.loop.add_callback(w.close, report=False, executor_wait=False) 2021-06-13T17:50:12.4090143Z if w.status == Status.running: 2021-06-13T17:50:12.4090730Z w.loop.add_callback(w.close) 2021-06-13T17:50:12.4091313Z Worker._instances.clear() 2021-06-13T17:50:12.4093411Z 2021-06-13T17:50:12.4093778Z start = time() 2021-06-13T17:50:12.4094383Z while any(c.status != "closed" for c in Worker._initialized_clients): 2021-06-13T17:50:12.4096722Z sleep(0.1) 2021-06-13T17:50:12.4097128Z assert time() < start + 10 2021-06-13T17:50:12.4097688Z Worker._initialized_clients.clear() 2021-06-13T17:50:12.4099499Z 2021-06-13T17:50:12.4099838Z for i in range(5): 2021-06-13T17:50:12.4100344Z if all(c.closed() for c in Comm._instances): 2021-06-13T17:50:12.4100811Z break 2021-06-13T17:50:12.4103249Z else: 2021-06-13T17:50:12.4103799Z sleep(0.1) 2021-06-13T17:50:12.4104180Z else: 2021-06-13T17:50:12.4104673Z L = [c for c in Comm._instances if not c.closed()] 2021-06-13T17:50:12.4106740Z Comm._instances.clear() 2021-06-13T17:50:12.4107252Z print("Unclosed Comms", L) 2021-06-13T17:50:12.4107984Z # raise ValueError("Unclosed Comms", L) 2021-06-13T17:50:12.4109982Z 2021-06-13T17:50:12.4110349Z > assert all( 2021-06-13T17:50:12.4111055Z n.status == Status.closed or n.status == Status.init for n in Nanny._instances 2021-06-13T17:50:12.4113249Z ), {n: n.status for n in Nanny._instances} 2021-06-13T17:50:12.4116294Z E AssertionError: {: , : , : } 2021-06-13T17:50:12.4118819Z E assert False 2021-06-13T17:50:12.4119740Z E + where False = all(. at 0x7f9f62c9a4a0>) 2021-06-13T17:50:12.4120826Z 2021-06-13T17:50:12.4122764Z distributed/utils_test.py:1525: AssertionError ```
gjoseph92 commented 3 years ago

Got this too: https://github.com/dask/distributed/pull/4937/checks?check_run_id=2888140312#step:10:2213

Traceback + output ``` =================================== FAILURES =================================== __________________________ test_AllProgress_lost_key ___________________________ def test_func(): result = None workers = [] with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop: async def coro(): with dask.config.set(config): s = False for _ in range(60): try: s, ws = await start_cluster( nthreads, scheduler, loop, security=security, Worker=Worker, scheduler_kwargs=scheduler_kwargs, worker_kwargs=worker_kwargs, ) except Exception as e: logger.error( "Failed to start gen_cluster: " f"{e.__class__.__name__}: {e}; retrying", exc_info=True, ) await asyncio.sleep(1) else: workers[:] = ws args = [s] + workers break if s is False: raise Exception("Could not start cluster") if client: c = await Client( s.address, loop=loop, security=security, asynchronous=True, **client_kwargs, ) args = [c] + args try: future = func(*args) if timeout: future = asyncio.wait_for(future, timeout) result = await future if s.validate: s.validate_state() finally: if client and c.status not in ("closing", "closed"): await c._close(fast=s.status == Status.closed) await end_cluster(s, workers) await asyncio.wait_for(cleanup_global_workers(), 1) try: c = await default_client() except ValueError: pass else: await c._close(fast=True) def get_unclosed(): return [c for c in Comm._instances if not c.closed()] + [ c for c in _global_clients.values() if c.status != "closed" ] try: start = time() while time() < start + 60: gc.collect() if not get_unclosed(): break await asyncio.sleep(0.05) else: if allow_unclosed: print(f"Unclosed Comms: {get_unclosed()}") else: raise RuntimeError("Unclosed Comms", get_unclosed()) finally: Comm._instances.clear() _global_clients.clear() return result > result = loop.run_sync( coro, timeout=timeout * 2 if timeout else timeout ) distributed/utils_test.py:966: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = func = ._..test_func..coro at 0x7ffbc0091ee0> timeout = 60 def run_sync(self, func: Callable, timeout: Optional[float] = None) -> Any: """Starts the `IOLoop`, runs the given function, and stops the loop. The function must return either an awaitable object or ``None``. If the function returns an awaitable object, the `IOLoop` will run until the awaitable is resolved (and `run_sync()` will return the awaitable's result). If it raises an exception, the `IOLoop` will stop and the exception will be re-raised to the caller. The keyword-only argument ``timeout`` may be used to set a maximum duration for the function. If the timeout expires, a `tornado.util.TimeoutError` is raised. This method is useful to allow asynchronous calls in a ``main()`` function:: async def main(): # do stuff... if __name__ == '__main__': IOLoop.current().run_sync(main) .. versionchanged:: 4.3 Returning a non-``None``, non-awaitable value is now an error. .. versionchanged:: 5.0 If a timeout occurs, the ``func`` coroutine will be cancelled. """ future_cell = [None] # type: List[Optional[Future]] def run() -> None: try: result = func() if result is not None: from tornado.gen import convert_yielded result = convert_yielded(result) except Exception: fut = Future() # type: Future[Any] future_cell[0] = fut future_set_exc_info(fut, sys.exc_info()) else: if is_future(result): future_cell[0] = result else: fut = Future() future_cell[0] = fut fut.set_result(result) assert future_cell[0] is not None self.add_future(future_cell[0], lambda future: self.stop()) self.add_callback(run) if timeout is not None: def timeout_callback() -> None: # If we can cancel the future, do so and wait on it. If not, # Just stop the loop and return with the task still pending. # (If we neither cancel nor wait for the task, a warning # will be logged). assert future_cell[0] is not None if not future_cell[0].cancel(): self.stop() timeout_handle = self.add_timeout(self.time() + timeout, timeout_callback) self.start() if timeout is not None: self.remove_timeout(timeout_handle) assert future_cell[0] is not None if future_cell[0].cancelled() or not future_cell[0].done(): > raise TimeoutError("Operation timed out after %s seconds" % timeout) E tornado.util.TimeoutError: Operation timed out after 60 seconds /usr/share/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/tornado/ioloop.py:529: TimeoutError ----------------------------- Captured stderr call ----------------------------- Traceback (most recent call last): File "", line 1, in File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__ self._semlock = _multiprocessing.SemLock._rebuild(*state) FileNotFoundError: [Errno 2] No such file or directory Traceback (most recent call last): File "", line 1, in File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__ self._semlock = _multiprocessing.SemLock._rebuild(*state) FileNotFoundError: [Errno 2] No such file or directory distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Restarting worker distributed.worker - INFO - Start worker at: tcp://127.0.0.1:42907 distributed.worker - INFO - Listening to: tcp://127.0.0.1:42907 distributed.worker - INFO - dashboard at: 127.0.0.1:36249 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:33233 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 6.79 GiB distributed.worker - INFO - Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-qzucux6g distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: tcp://127.0.0.1:33233 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.worker - INFO - Start worker at: tcp://127.0.0.1:35647 distributed.worker - INFO - Listening to: tcp://127.0.0.1:35647 distributed.worker - INFO - dashboard at: 127.0.0.1:45463 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:33233 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 2 distributed.worker - INFO - Memory: 6.79 GiB distributed.worker - INFO - Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-24jmyssw distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: tcp://127.0.0.1:33233 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:35647 distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Nanny failed to start in 10 seconds; retrying Traceback (most recent call last): File "/home/runner/work/distributed/distributed/distributed/core.py", line 272, in _ await asyncio.wait_for(self.start(), timeout=timeout) File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/asyncio/tasks.py", line 501, in wait_for raise exceptions.TimeoutError() asyncio.exceptions.TimeoutError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/runner/work/distributed/distributed/distributed/utils_test.py", line 890, in coro s, ws = await start_cluster( File "/home/runner/work/distributed/distributed/distributed/utils_test.py", line 814, in start_cluster await asyncio.gather(*workers) File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable return (yield from awaitable.__await__()) File "/home/runner/work/distributed/distributed/distributed/core.py", line 276, in _ raise TimeoutError( asyncio.exceptions.TimeoutError: Nanny failed to start in 10 seconds distributed.worker - INFO - Start worker at: tcp://127.0.0.1:42805 distributed.worker - INFO - Listening to: tcp://127.0.0.1:42805 distributed.worker - INFO - dashboard at: 127.0.0.1:41279 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:35561 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 6.79 GiB distributed.worker - INFO - Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-au2uvenv distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: tcp://127.0.0.1:35561 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.worker - INFO - Start worker at: tcp://127.0.0.1:37355 distributed.worker - INFO - Listening to: tcp://127.0.0.1:37355 distributed.worker - INFO - dashboard at: 127.0.0.1:33053 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:35561 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 2 distributed.worker - INFO - Memory: 6.79 GiB distributed.worker - INFO - Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-ficp2l0q distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: tcp://127.0.0.1:35561 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:42805 distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:37355 distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 37% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 37% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 38% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 39% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 39% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 40% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 41% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 41% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 42% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 42% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 43% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 53% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 59% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 83% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 84% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 85% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 86% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 87% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 88% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 89% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 90% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 91% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 92% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 93% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 94% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 95% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 99% CPU time recently (threshold: 10%) distributed.worker - INFO - Connection to scheduler broken. Reconnecting... ```

The

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

seemingly leading to a Nanny failure and timeout (and subsequent 99% time spent in GC?!) is interesting.

fjetter commented 3 years ago

I've seen the GC message in some other "flaky" tests as well, I believe

fjetter commented 3 years ago

What's also interesting is, that we only see one error log of Failed to start gen_cluster: although this section should be retried 60 times. Considering that the nanny times out after about 10s I would've expected to see at least to cluster start failures (gen_cluster timeout is at 30) so the GC might actually block here?

Both links also merely happen on py3.8 but that may be a coincidence.