Restart can kill a worker

fjetter commented 2 years ago

When restarting a cluster of Nannies we expect every worker to be restarted and connect to the worker.

There appears to exist a race condition where the worker instead closes for good and tears down the Nanny with it.

This has been frequently observed in the coiled-runtime benchmarks, see https://github.com/coiled/coiled-runtime/issues/468

Some sample logs (I added line numbers to reference logs below)

1. 2022-11-09 22:16:04,681 - distributed.nanny - INFO -         Start Nanny at: 'tls://10.0.6.215:45015'
2. 2022-11-09 22:16:05,304 - distributed.worker - INFO -       Start worker at:     tls://10.0.6.215:41071
3. 2022-11-09 22:16:05,305 - distributed.worker - INFO -          Listening to:     tls://10.0.6.215:41071
4. 2022-11-09 22:16:05,305 - distributed.worker - INFO -           Worker name: test_dataframe-86e8569e-worker-012275ba16
5. 2022-11-09 22:16:05,306 - distributed.worker - INFO -          dashboard at:            10.0.6.215:8787
6. 2022-11-09 22:16:05,306 - distributed.worker - INFO - Waiting to connect to:       tls://10.0.5.46:8786
7. 2022-11-09 22:16:05,306 - distributed.worker - INFO - -------------------------------------------------
8. 2022-11-09 22:16:05,306 - distributed.worker - INFO -               Threads:                          2
9. 2022-11-09 22:16:05,306 - distributed.worker - INFO -                Memory:                   7.16 GiB
10. 2022-11-09 22:16:05,306 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-p5w95n2u
11. 2022-11-09 22:16:05,306 - distributed.worker - INFO - -------------------------------------------------
12. 2022-11-09 22:16:05,607 - distributed.worker - INFO - Starting Worker plugin coiled-aws-spot
13. 2022-11-09 22:16:05,607 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
14. 2022-11-09 22:16:05,607 - distributed.worker - INFO - coiled updating aws sts token environment variables
15. 2022-11-09 22:16:05,933 - distributed.worker - INFO - coiled cleared s3fs/boto session
16. 2022-11-09 22:16:05,933 - distributed.worker - INFO -         Registered to:       tls://10.0.5.46:8786
17. 2022-11-09 22:16:05,933 - distributed.worker - INFO - -------------------------------------------------
18. 2022-11-09 22:16:05,934 - distributed.core - INFO - Starting established connection to tls://10.0.5.46:8786
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  First Worker up and in Worker.handle_scheduler ^^^^^^^^^^^^^^^^^^^^^^^^^^
19. 2022-11-09 22:16:09,480 - distributed.dask_worker - INFO - Not all prometheus metrics available are exported. Digest-based metrics require crick to be installed
20. 2022-11-09 22:16:41,493 - distributed.utils_perf - INFO - full garbage collection released 55.36 MiB from 403 reference cycles (threshold: 9.54 MiB)
21. 2022-11-09 22:17:07,596 - distributed.nanny - INFO - Nanny asking worker to close. Reason: scheduler-restart
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  First Restart ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22. 2022-11-09 22:17:07,597 - distributed.worker - INFO - Stopping worker at tls://10.0.6.215:41071. Reason: scheduler-restart
23. 2022-11-09 22:17:07,598 - distributed.core - INFO - Connection to tls://10.0.5.46:8786 has been closed.
24. 2022-11-09 22:17:07,601 - distributed.nanny - INFO - Worker closed
# ^^^^^^^^^^^^^^^^ First Worker Process closed triggering worker_on_exit callback to restart ^^^^^^^^^^^^^^
25. 2022-11-09 22:17:07,857 - distributed.nanny - WARNING - Restarting worker
26. 2022-11-09 22:17:08,549 - distributed.worker - INFO -       Start worker at:     tls://10.0.6.215:41295
27. 2022-11-09 22:17:08,549 - distributed.worker - INFO -          Listening to:     tls://10.0.6.215:41295
28. 2022-11-09 22:17:08,549 - distributed.worker - INFO -           Worker name: test_dataframe-86e8569e-worker-012275ba16
29. 2022-11-09 22:17:08,549 - distributed.worker - INFO -          dashboard at:            10.0.6.215:8787
30. 2022-11-09 22:17:08,549 - distributed.worker - INFO - Waiting to connect to:       tls://10.0.5.46:8786
31. 2022-11-09 22:17:08,550 - distributed.worker - INFO - -------------------------------------------------
32. 2022-11-09 22:17:08,550 - distributed.worker - INFO -               Threads:                          2
33. 2022-11-09 22:17:08,550 - distributed.worker - INFO -                Memory:                   7.16 GiB
34. 2022-11-09 22:17:08,550 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-k7maxf3u
35. 2022-11-09 22:17:08,551 - distributed.worker - INFO - -------------------------------------------------
36. 2022-11-09 22:17:09,076 - distributed.worker - INFO - Starting Worker plugin coiled-aws-spot
37. 2022-11-09 22:17:09,076 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
38. 2022-11-09 22:17:09,076 - distributed.worker - INFO - coiled updating aws sts token environment variables
39. 2022-11-09 22:17:09,545 - distributed.worker - INFO - coiled cleared s3fs/boto session
40. 2022-11-09 22:17:09,546 - distributed.worker - INFO -         Registered to:       tls://10.0.5.46:8786
# ^^^^^^^^^^^^^ Second worker established connection to Scheduler but not in handle_scheduler yet ^^^^^^^^^^^
41. 2022-11-09 22:17:09,550 - distributed.worker - INFO - -------------------------------------------------
43. 2022-11-09 22:17:09,551 - distributed.nanny - INFO - Nanny asking worker to close. Reason: scheduler-restart
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  Second Restart ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
44. 2022-11-09 22:17:09,597 - distributed.dask_worker - INFO - Not all prometheus metrics available are exported. Digest-based metrics require crick to be installed
45. 2022-11-09 22:17:09,604 - distributed.core - INFO - Starting established connection to tls://10.0.5.46:8786
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Second worker in Worker.handle_scheduler ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46. 2022-11-09 22:17:09,604 - distributed.worker - INFO - Stopping worker at tls://10.0.6.215:41295. Reason: scheduler-restart
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Second worker in Worker.close ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
47. 2022-11-09 22:17:09,606 - distributed.core - INFO - Connection to tls://10.0.5.46:8786 has been closed.
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Second worker in Worker.handle_scheduler encounters CommClosed ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
48. 2022-11-09 22:17:09,607 - distributed.worker - INFO - Stopping worker at tls://10.0.6.215:41295. Reason: worker-handle-scheduler-connection-broken
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ CommClosed triggers another close with nanny=True which is not waiting for the first one ^^^^^^^^^^^^^^^^^^^^^^^
49. 2022-11-09 22:17:09,614 - distributed.nanny - INFO - Closing Nanny gracefully at 'tls://10.0.6.215:45015'. Reason: worker-handle-scheduler-connection-broken
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Nanny is closed as well, such that nothing is restarted ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
50. 2022-11-09 22:17:09,615 - distributed.nanny - INFO - Worker closed
51. 2022-11-09 22:17:09,679 - distributed.worker - ERROR -
52. Traceback (most recent call last):
53.   File "/opt/coiled/env/lib/python3.9/site-packages/distributed/utils.py", line 742, in wrapper
54.     return await func(*args, **kwargs)
55.   File "/opt/coiled/env/lib/python3.9/site-packages/distributed/worker.py", line 1618, in close
56.     await to_thread(_close, executor=executor, wait=executor_wait)
57.   File "/opt/coiled/env/lib/python3.9/asyncio/threads.py", line 25, in to_thread
58.     return await loop.run_in_executor(None, func_call)
59. asyncio.exceptions.CancelledError
# ^^^^^^^^^^^^^^^^^^ One of the two close requests is trying to shut down the threadpool but this hangs for some reason. Related? Who is cancelling ^^^^^^^^^^^^^^
60. 2022-11-09 22:17:09,814 - distributed.nanny - INFO - Closing Nanny at 'tls://10.0.6.215:45015'. Reason: nanny-close-gracefully
61. 2022-11-09 22:17:09,814 - distributed.preloading - INFO - Run preload teardown: https://cloud.coiled.io/api/v2/cluster_facing/preload/worker

What this translate to is

L1: Nanny is started.
L2-L19: First worker is started + connected.
L21-L25: First worker is restarted
L26-L41 + L45: Second worker is starting ^^^^ Note the interleaving with subsequent restart
L43: Before the start is completed, Nanny a restart already
L46: Worker enters Worker.close(nanny=False) for the first time
Worker likely sends a stream-close to the scheduler (TODO: verify if possible) such that the batched stream is aborted
L47: Worker.handle_stream encounters an exception due to the closed stream
L48ff: Worker initiates another Worker.close(nanny=True) due to the severed connection. This should wait for the first close to terminate and do nothing. However, it initiates another close procedure, this time with the nanny which closes the server for good

What I do not understand is why the second close is not waiting for Server.finished in https://github.com/dask/distributed/blob/ef13425ff77a0ae3ec14e0cb3bc6c3a87c363dd6/distributed/worker.py#L1491-L1498 This can only happen if the status after the first close (L46) is altered. As far as I can tell, all status mutations trigger a log message (either on scheduler or worker side) but I couldn't find anything.

The entire thing is over in slightly under 2s.

fjetter commented 2 years ago

The issue is a race condition in a narrow time window after the worker established the scheduler connection but before the server base class finishes the start.

Specifically, the Worker sets the status attribute to running in https://github.com/dask/distributed/blob/ef13425ff77a0ae3ec14e0cb3bc6c3a87c363dd6/distributed/worker.py#L1179 and the Server base class sets it again in https://github.com/dask/distributed/blob/ef13425ff77a0ae3ec14e0cb3bc6c3a87c363dd6/distributed/core.py#L491

During this time, which is more or less one event loop tick, the close status can be reset by the server start s.t. subsequent Worker.close calls are not caught by the guard

fjetter commented 2 years ago

This issue can be exacerbated if there are slow plugins around that require ticks to startup

fjetter commented 2 years ago

The below is a reproducer. Interestingly, the Nanny.restart API is not suffering from this problem since it does not rely on a done callback but instantiates the worker explicitly.

@pytest.mark.parametrize("api", ["restart", "kill"])
@gen_cluster(client=True, nthreads=[("", 1)], Worker=Nanny)
async def test_restart_stress(c, s, a, api):
    async def keep_killing():
        pool = await ConnectionPool()
        try:
            rpc = pool(a.address)
            for _ in range(10):
                try:
                    meth = getattr(rpc, api)
                    await meth(reason="scheduler-restart")
                except OSError:
                    break

                await asyncio.sleep(0.1)
        finally:
            await pool.close()

    kill_task = asyncio.create_task(keep_killing())
    await kill_task
    assert a.status == Status.running

FWIW spawning multiple keep_killing tasks is also fine because Nanny.instantiate that's running behind Nanny.restart is idempotent

dask / distributed

Restart can kill a worker #7312