dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Two (three?) failing tests on Debian with Python 3.12: `test_steal_twice` and `test_client_worker` #8437

Open juliangilbey opened 10 months ago

juliangilbey commented 10 months ago

Describe the issue: I am in the process of trying to update the Debian version of dask, have just tried running the test suite on Debian unstable with version 2023.12.1 and Python 3.12. There are three test failures that I don't understand, and don't know whether they are significant. Any thoughts would be appreciated!

These are being run in a clean chroot environment, and I can provide the full log if that would be useful.

(1) test_process_executor_kills_process: this just hangs until the pytest process is manually killed. (Only tested on Python 3.12.)

(2) test_client_worker fails with Python 3.12 but succeeds with Python 3.11; the output is below.

(3) test_steal_twice also fails, both with Python 3.11 and 3.12. The Python 3.12 output is below.

Failure output for (2) and (3) under Python 3.12:

______________________________ test_client_worker ______________________________

c = <Client: No scheduler connected>
s = <Scheduler 'tcp://127.0.0.1:44761', workers: 0, cores: 0, tasks: 0>
a = <Worker 'tcp://127.0.0.1:36803', name: 0, status: closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>
b = <Worker 'tcp://127.0.0.1:41311', name: 1, status: closed, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>

    @gen_cluster(client=True)
    async def test_client_worker(c, s, a, b):
        sub = Sub("a", client=c, worker=None)

        def f(x):
            pub = Pub("a")
            pub.put(x)

        futures = c.map(f, range(10))
        await wait(futures)

        L = []
        for _ in range(10):
            result = await sub.get()
            L.append(result)

        assert set(L) == set(range(10))

        sps = s.extensions["pubsub"]
        aps = a.extensions["pubsub"]
        bps = b.extensions["pubsub"]

        start = time()
        while (
            sps.publishers["a"]
            or sps.subscribers["a"]
            or aps.publishers["a"]
            or bps.publishers["a"]
            or len(sps.client_subscribers["a"]) != 1
        ):
            await asyncio.sleep(0.01)
            assert time() < start + 3

        del sub

        start = time()
        while (
            sps.client_subscribers
            or any(aps.publish_to_scheduler.values())
            or any(bps.publish_to_scheduler.values())
        ):
            await asyncio.sleep(0.01)
>           assert time() < start + 3
E           assert 1704108934.5217936 < (1704108931.5140293 + 3)
E            +  where 1704108934.5217936 = time()

../../../distributed/tests/test_pubsub.py:118: AssertionError
----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_client_worker.yaml
----------------------------- Captured stderr call -----------------------------
2024-01-01 11:35:31,377 - distributed.scheduler - INFO - State start
2024-01-01 11:35:31,390 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:44761
2024-01-01 11:35:31,391 - distributed.scheduler - INFO -   dashboard at:  http://127.0.0.1:38493/status
2024-01-01 11:35:31,391 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-01-01 11:35:31,415 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:36803
2024-01-01 11:35:31,415 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:36803
2024-01-01 11:35:31,415 - distributed.worker - INFO -           Worker name:                          0
2024-01-01 11:35:31,415 - distributed.worker - INFO -          dashboard at:            127.0.0.1:36319
2024-01-01 11:35:31,415 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:44761
2024-01-01 11:35:31,415 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:35:31,415 - distributed.worker - INFO -               Threads:                          1
2024-01-01 11:35:31,415 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:35:31,415 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-q6kk2f3p
2024-01-01 11:35:31,415 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:35:31,416 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:41311
2024-01-01 11:35:31,416 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:41311
2024-01-01 11:35:31,416 - distributed.worker - INFO -           Worker name:                          1
2024-01-01 11:35:31,416 - distributed.worker - INFO -          dashboard at:            127.0.0.1:34555
2024-01-01 11:35:31,416 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:44761
2024-01-01 11:35:31,416 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:35:31,416 - distributed.worker - INFO -               Threads:                          2
2024-01-01 11:35:31,416 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:35:31,416 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-a6nx6vma
2024-01-01 11:35:31,416 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:35:31,435 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:36803', name: 0, status: init, memory: 0, processing: 0>
2024-01-01 11:35:31,444 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:36803
2024-01-01 11:35:31,444 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:47390
2024-01-01 11:35:31,444 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:41311', name: 1, status: init, memory: 0, processing: 0>
2024-01-01 11:35:31,452 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:41311
2024-01-01 11:35:31,452 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:47394
2024-01-01 11:35:31,453 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:35:31,453 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:35:31,453 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:44761
2024-01-01 11:35:31,453 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:35:31,454 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:44761
2024-01-01 11:35:31,454 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:35:31,454 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:44761
2024-01-01 11:35:31,454 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:44761
2024-01-01 11:35:31,464 - distributed.scheduler - INFO - Receive client connection: Client-deaf1f3d-a899-11ee-ba23-049226da1b49
2024-01-01 11:35:31,473 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:47398
2024-01-01 11:35:34,652 - distributed.scheduler - INFO - Remove client Client-deaf1f3d-a899-11ee-ba23-049226da1b49
2024-01-01 11:35:34,652 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:47398; closing.
2024-01-01 11:35:34,652 - distributed.scheduler - INFO - Remove client Client-deaf1f3d-a899-11ee-ba23-049226da1b49
2024-01-01 11:35:34,653 - distributed.scheduler - INFO - Close client connection: Client-deaf1f3d-a899-11ee-ba23-049226da1b49
2024-01-01 11:35:34,654 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:36803. Reason: worker-close
2024-01-01 11:35:34,654 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:41311. Reason: worker-close
2024-01-01 11:35:34,656 - distributed.core - INFO - Connection to tcp://127.0.0.1:44761 has been closed.
2024-01-01 11:35:34,656 - distributed.core - INFO - Connection to tcp://127.0.0.1:44761 has been closed.
2024-01-01 11:35:34,656 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:47390; closing.
2024-01-01 11:35:34,657 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:47394; closing.
2024-01-01 11:35:34,657 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:36803', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1704108934.6572363')
2024-01-01 11:35:34,657 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:41311', name: 1, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1704108934.6574624')
2024-01-01 11:35:34,657 - distributed.scheduler - INFO - Lost all workers
2024-01-01 11:35:34,659 - distributed.scheduler - INFO - Scheduler closing due to unknown reason...
2024-01-01 11:35:34,659 - distributed.scheduler - INFO - Scheduler closing all comms
_______________________________ test_steal_twice _______________________________

c = <Client: No scheduler connected>
s = <Scheduler 'tcp://127.0.0.1:38197', workers: 0, cores: 0, tasks: 0>
a = <Worker 'tcp://127.0.0.1:37193', name: 0, status: closed, stored: 0, running: 1/1, ready: 0, comm: 0, waiting: 0>
b = <Worker 'tcp://127.0.0.1:46287', name: 1, status: closed, stored: 0, running: 1/2, ready: 0, comm: 0, waiting: 0>

    @gen_cluster(client=True)
    async def test_steal_twice(c, s, a, b):
        x = c.submit(inc, 1, workers=a.address)
        await wait(x)

        futures = [c.submit(slowadd, x, i, delay=0.2) for i in range(100)]

        while len(s.tasks) < 100:  # tasks are all allocated
            await asyncio.sleep(0.01)
        if math.isinf(s.WORKER_SATURATION):
            # Wait for b to start stealing tasks
            while len(b.state.tasks) < 30:
                await asyncio.sleep(0.01)
        else:
            # Wait for b to complete some tasks
            while len(b.data) < 8:
                await asyncio.sleep(0.01)

        # Army of new workers arrives to help
        async with contextlib.AsyncExitStack() as stack:
            # This is pretty timing sensitive
            workers = [stack.enter_async_context(Worker(s.address)) for _ in range(10)]
            workers = await asyncio.gather(*workers)

            await wait(futures)

            # Note: this includes a and b
            empty_workers = [ws for ws in s.workers.values() if not ws.has_what]
>           assert (
                len(empty_workers) < 3
            ), f"Too many workers without keys ({len(empty_workers)} out of {len(s.workers)})"
E           AssertionError: Too many workers without keys (4 out of 12)
E           assert 4 < 3
E            +  where 4 = len([<WorkerState 'tcp://127.0.0.1:34201', status: running, memory: 0, processing: 0>, <WorkerState 'tcp://127.0.0.1:35719', status: running, memory: 0, processing: 0>, <WorkerState 'tcp://127.0.0.1:40373', status: running, memory: 0, processing: 0>, <WorkerState 'tcp://127.0.0.1:45623', status: running, memory: 0, processing: 0>])

../../../distributed/tests/test_steal.py:852: AssertionError
----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_steal_twice.yaml
----------------------------- Captured stderr call -----------------------------
2024-01-01 11:37:35,055 - distributed.scheduler - INFO - State start
2024-01-01 11:37:35,072 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:38197
2024-01-01 11:37:35,072 - distributed.scheduler - INFO -   dashboard at:  http://127.0.0.1:35155/status
2024-01-01 11:37:35,072 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-01-01 11:37:35,104 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:37193
2024-01-01 11:37:35,104 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:37193
2024-01-01 11:37:35,104 - distributed.worker - INFO -           Worker name:                          0
2024-01-01 11:37:35,104 - distributed.worker - INFO -          dashboard at:            127.0.0.1:43199
2024-01-01 11:37:35,104 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:35,104 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:35,104 - distributed.worker - INFO -               Threads:                          1
2024-01-01 11:37:35,104 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:35,105 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-07ig2csl
2024-01-01 11:37:35,105 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:35,105 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:46287
2024-01-01 11:37:35,105 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:46287
2024-01-01 11:37:35,105 - distributed.worker - INFO -           Worker name:                          1
2024-01-01 11:37:35,105 - distributed.worker - INFO -          dashboard at:            127.0.0.1:39487
2024-01-01 11:37:35,105 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:35,105 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:35,105 - distributed.worker - INFO -               Threads:                          2
2024-01-01 11:37:35,105 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:35,105 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-dnv7hb76
2024-01-01 11:37:35,106 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:35,130 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:37193', name: 0, status: init, memory: 0, processing: 0>
2024-01-01 11:37:35,141 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:37193
2024-01-01 11:37:35,142 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:44500
2024-01-01 11:37:35,142 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:46287', name: 1, status: init, memory: 0, processing: 0>
2024-01-01 11:37:35,153 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:46287
2024-01-01 11:37:35,153 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:44514
2024-01-01 11:37:35,153 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:35,154 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:35,154 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:35,154 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:35,154 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:35,154 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:35,155 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:35,155 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:35,168 - distributed.scheduler - INFO - Receive client connection: Client-286a55d3-a89a-11ee-ba23-049226da1b49
2024-01-01 11:37:35,180 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:44530
2024-01-01 11:37:36,184 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:44363
2024-01-01 11:37:36,184 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:44363
2024-01-01 11:37:36,184 - distributed.worker - INFO -          dashboard at:            127.0.0.1:33317
2024-01-01 11:37:36,184 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,184 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,184 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,184 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,184 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-zecwpx2x
2024-01-01 11:37:36,184 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,185 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:34909
2024-01-01 11:37:36,185 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:34909
2024-01-01 11:37:36,185 - distributed.worker - INFO -          dashboard at:            127.0.0.1:39159
2024-01-01 11:37:36,185 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,185 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,185 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,185 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,185 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-52o88w_9
2024-01-01 11:37:36,185 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,186 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:36777
2024-01-01 11:37:36,186 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:36777
2024-01-01 11:37:36,186 - distributed.worker - INFO -          dashboard at:            127.0.0.1:35567
2024-01-01 11:37:36,186 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,186 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,186 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,186 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,186 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-t8qerswb
2024-01-01 11:37:36,186 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,187 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:40373
2024-01-01 11:37:36,187 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:40373
2024-01-01 11:37:36,187 - distributed.worker - INFO -          dashboard at:            127.0.0.1:37427
2024-01-01 11:37:36,187 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,187 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,187 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,187 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,187 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-3607u4hm
2024-01-01 11:37:36,187 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,188 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:34201
2024-01-01 11:37:36,188 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:34201
2024-01-01 11:37:36,188 - distributed.worker - INFO -          dashboard at:            127.0.0.1:41503
2024-01-01 11:37:36,188 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,188 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,188 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,188 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,188 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-mru3ab1o
2024-01-01 11:37:36,188 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,189 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:35719
2024-01-01 11:37:36,189 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:35719
2024-01-01 11:37:36,189 - distributed.worker - INFO -          dashboard at:            127.0.0.1:33141
2024-01-01 11:37:36,189 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,189 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,189 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,189 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,189 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-y4py4qv6
2024-01-01 11:37:36,189 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,190 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:45731
2024-01-01 11:37:36,190 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:45731
2024-01-01 11:37:36,190 - distributed.worker - INFO -          dashboard at:            127.0.0.1:32859
2024-01-01 11:37:36,190 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,190 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,190 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,190 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,190 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-inal24jm
2024-01-01 11:37:36,190 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,190 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:41249
2024-01-01 11:37:36,190 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:41249
2024-01-01 11:37:36,191 - distributed.worker - INFO -          dashboard at:            127.0.0.1:45131
2024-01-01 11:37:36,191 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,191 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,191 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,191 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,191 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-6ribf305
2024-01-01 11:37:36,191 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,191 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:45623
2024-01-01 11:37:36,191 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:45623
2024-01-01 11:37:36,191 - distributed.worker - INFO -          dashboard at:            127.0.0.1:39211
2024-01-01 11:37:36,192 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,192 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,192 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,192 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,192 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-a0qy1nxt
2024-01-01 11:37:36,192 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,192 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:41985
2024-01-01 11:37:36,192 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:41985
2024-01-01 11:37:36,192 - distributed.worker - INFO -          dashboard at:            127.0.0.1:33809
2024-01-01 11:37:36,192 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,192 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,193 - distributed.worker - INFO -               Threads:                         32
2024-01-01 11:37:36,193 - distributed.worker - INFO -                Memory:                  62.64 GiB
2024-01-01 11:37:36,193 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-0m2s0v9t
2024-01-01 11:37:36,193 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,316 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:44363', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,328 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:44363
2024-01-01 11:37:36,328 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38954
2024-01-01 11:37:36,328 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:34909', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,340 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:34909
2024-01-01 11:37:36,340 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38970
2024-01-01 11:37:36,340 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:36777', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,351 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:36777
2024-01-01 11:37:36,351 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38986
2024-01-01 11:37:36,351 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:40373', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,362 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:40373
2024-01-01 11:37:36,362 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38996
2024-01-01 11:37:36,363 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:34201', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,374 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:34201
2024-01-01 11:37:36,374 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:39008
2024-01-01 11:37:36,374 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:35719', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,385 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:35719
2024-01-01 11:37:36,385 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:39016
2024-01-01 11:37:36,385 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:45731', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,396 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:45731
2024-01-01 11:37:36,396 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:39018
2024-01-01 11:37:36,396 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:41249', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,407 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:41249
2024-01-01 11:37:36,407 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:39026
2024-01-01 11:37:36,408 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:45623', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,419 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:45623
2024-01-01 11:37:36,419 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:39036
2024-01-01 11:37:36,419 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:41985', status: init, memory: 0, processing: 0>
2024-01-01 11:37:36,430 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:41985
2024-01-01 11:37:36,430 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:39042
2024-01-01 11:37:36,433 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,433 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,433 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,433 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,433 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,434 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,434 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,434 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,434 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,434 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-01-01 11:37:36,435 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,435 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,436 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,436 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,436 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,436 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,436 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,436 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,436 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,436 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,436 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,436 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,437 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,437 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,437 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,437 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,437 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,437 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,437 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38197
2024-01-01 11:37:36,437 - distributed.worker - INFO - -------------------------------------------------
2024-01-01 11:37:36,441 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,441 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,441 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,441 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,442 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,442 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,442 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,442 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,442 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,442 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38197
2024-01-01 11:37:36,989 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:41985. Reason: worker-close
2024-01-01 11:37:36,991 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:36,992 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:39042; closing.
2024-01-01 11:37:36,992 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:41985', status: closing, memory: 4, processing: 0> (stimulus_id='handle-worker-cleanup-1704109056.9923167')
2024-01-01 11:37:36,995 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:45623. Reason: worker-close
2024-01-01 11:37:37,001 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:39036; closing.
2024-01-01 11:37:37,001 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:45623', status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1704109057.0015428')
2024-01-01 11:37:37,001 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:37,003 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:41249. Reason: worker-close
2024-01-01 11:37:37,004 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-90bd7d0ecdc0a3291b408b77b85a1bf8')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,004 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:37,005 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:39026; closing.
2024-01-01 11:37:37,005 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:41249', status: closing, memory: 3, processing: 1> (stimulus_id='handle-worker-cleanup-1704109057.0052762')
2024-01-01 11:37:37,200 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:45731. Reason: worker-close
2024-01-01 11:37:37,202 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:39018; closing.
2024-01-01 11:37:37,202 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:45731', status: closing, memory: 4, processing: 0> (stimulus_id='handle-worker-cleanup-1704109057.202909')
2024-01-01 11:37:37,203 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:37,210 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:35719. Reason: worker-close
2024-01-01 11:37:37,213 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:37,213 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:39016; closing.
2024-01-01 11:37:37,213 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:35719', status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1704109057.2136083')
2024-01-01 11:37:37,215 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:34201. Reason: worker-close
2024-01-01 11:37:37,216 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:37,216 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:39008; closing.
2024-01-01 11:37:37,216 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:34201', status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1704109057.2167583')
2024-01-01 11:37:37,217 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:40373. Reason: worker-close
2024-01-01 11:37:37,217 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:37,218 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:38996; closing.
2024-01-01 11:37:37,218 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:40373', status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1704109057.2182257')
2024-01-01 11:37:37,218 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:36777. Reason: worker-close
2024-01-01 11:37:37,218 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-ea6f558eafb3efba20128b762433ae4a')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,219 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-32d0de083bf7ac2d5b17c7a44a480cac')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,220 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:37,220 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:38986; closing.
2024-01-01 11:37:37,221 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:36777', status: closing, memory: 10, processing: 2> (stimulus_id='handle-worker-cleanup-1704109057.2209876')
2024-01-01 11:37:37,409 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:34909. Reason: worker-close
2024-01-01 11:37:37,410 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-1b51c7e05501ebc19a86890ba54b1177')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,410 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-32d0de083bf7ac2d5b17c7a44a480cac')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,411 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-ea6f558eafb3efba20128b762433ae4a')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,411 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-6b81e4b4b95f6d45edaa9e2c9083644a')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,411 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-ab54dafaaf8ae0696a554631c34478d3')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,411 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-4e1a09bb713778a3b2104ce734505348')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,411 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-4b9c5bfaffed8b06dda4e11a06946d01')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,411 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-3e6a1b18ad9cdbb9c5c6dbb36794429b')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,411 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-4e9d580b82d477e32c92757b17d71a3d')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,411 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-6641fcdf6ebfa9c26d8eab30ee67fa07')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,413 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:38970; closing.
2024-01-01 11:37:37,413 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:34909', status: closing, memory: 34, processing: 10> (stimulus_id='handle-worker-cleanup-1704109057.413482')
2024-01-01 11:37:37,417 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:37,436 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:44363. Reason: worker-close
2024-01-01 11:37:37,440 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-ec408b75d4e792350b064b773d4be426')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,440 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-004549f38e9e9a7ccc1466b766035fc6')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,440 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-f4832d2941c493814b65df33ca1154d6')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,440 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-1f5880ad16867a13e955ccd88624706e')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,440 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-82aa843de72899f0aaa4309311f736b9')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,440 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-79c2710d4716459f88ddcd510d2e3658')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,440 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-bf1bc166c2fa5333f02351a42eb325ca')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-0ac36fa4d901702159e66786e085aaf3')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-4e9d580b82d477e32c92757b17d71a3d')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-816e97f00d926d07f07c4c1e0f31830e')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-535de2b3bd7827ec0dfdbd5771890406')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-58f8aee9644eda6f698d1cbddbb63ad0')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-5ed8209a11bfaf8402fbf7c69e39aeae')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-d3d96cf0ff10bedb44b999590c81513e')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-210d453fb13148a4958e3a169192ac5e')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-48ff5589204a65d00eacb1a54365d032')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-f1aafa250b81a765e6c11668fd50e550')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-2e7f87e7c7494561e7895262fece6cf3')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-1fe123e0106fd45e8d22f31ae6e93718')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-0e8c03ba7e1f05ef611b29a64d9759e0')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,441 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-e921f6d7fa6e2f510519ab0e2b47bbdc')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-ab54dafaaf8ae0696a554631c34478d3')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-e8b189adb9b2658a0289ae1cad4229d5')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-4e33da58de5bc25833710e1327bbeb54')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-ee2b2cb878fc33521c30d0743a6cfc70')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-3018e8d5325396c9f713ac86fa76f8de')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-fef0f0a1df94c01f1c3e0890b7ba8a65')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-1b578d9311cdd4d28130038dba093c0c')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-0cae36d03fb89499ed4f2e4220422b0a')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-ea6f558eafb3efba20128b762433ae4a')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-e8a2df442b0671d2c8c50c6c29b44dac')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,442 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-90bd7d0ecdc0a3291b408b77b85a1bf8')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:37,445 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker.py", line 1252, in heartbeat
    response = await retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/utils_comm.py", line 455, in retry_operation
    return await retry(
           ^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/utils_comm.py", line 434, in retry
    return await coro()
           ^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/core.py", line 1394, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/core.py", line 1153, in send_recv
    response = await comm.read(deserializers=deserializers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/comm/tcp.py", line 237, in read
    convert_stream_closed_error(self, e)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:39070 remote=tcp://127.0.0.1:38197>: Stream is closed
2024-01-01 11:37:37,447 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:38954; closing.
2024-01-01 11:37:37,447 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:44363', status: closing, memory: 34, processing: 36> (stimulus_id='handle-worker-cleanup-1704109057.4473193')
2024-01-01 11:37:37,449 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:39,022 - distributed.scheduler - INFO - Remove client Client-286a55d3-a89a-11ee-ba23-049226da1b49
2024-01-01 11:37:39,024 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:44530; closing.
2024-01-01 11:37:39,025 - distributed.scheduler - INFO - Remove client Client-286a55d3-a89a-11ee-ba23-049226da1b49
2024-01-01 11:37:39,026 - distributed.scheduler - INFO - Close client connection: Client-286a55d3-a89a-11ee-ba23-049226da1b49
2024-01-01 11:37:39,028 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:37193. Reason: worker-close
2024-01-01 11:37:39,028 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:46287. Reason: worker-close
2024-01-01 11:37:39,029 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-4b9c5bfaffed8b06dda4e11a06946d01')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:39,029 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('slowadd-4e1a09bb713778a3b2104ce734505348')" coro=<Worker.execute() done, defined at /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12_distributed/build/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-01-01 11:37:39,031 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:44500; closing.
2024-01-01 11:37:39,031 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:44514; closing.
2024-01-01 11:37:39,031 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:37193', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1704109059.0319133')
2024-01-01 11:37:39,032 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:46287', name: 1, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1704109059.0321436')
2024-01-01 11:37:39,032 - distributed.scheduler - INFO - Lost all workers
2024-01-01 11:37:39,032 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:39,032 - distributed.core - INFO - Connection to tcp://127.0.0.1:38197 has been closed.
2024-01-01 11:37:39,223 - distributed.scheduler - INFO - Scheduler closing due to unknown reason...
2024-01-01 11:37:39,224 - distributed.scheduler - INFO - Scheduler closing all comms

Environment:

QuLogic commented 4 months ago

I think (1) should be fixed by #8715. (2) seems like a possibly aggressive timeout? I also see test_steal_twice as flaky, failing on some systems, but not others.