dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 720 forks source link

KeyError when running P2P rechunking #7599

Open d-v-b opened 1 year ago

d-v-b commented 1 year ago

(edit: I'm using python 3.10.8 and both dask and distributed are version 2022.3.1) I have several issues running this code:

Downsampling + rechunking ```python import dask.array as da import numpy as np import dask import time from distributed import Client from dask_jobqueue import LSFCluster shape = (2000,) * 3 source_chunks = (1024,) * 3 dest_chunks = (64,) * 3 data = da.random.randint(0,255, shape, dtype='uint8', chunks=source_chunks) levels = 8 multi = [data] for level in range(levels): multi.append(da.coarsen(np.mean, multi[-1], {0: 2, 1: 2, 2: 2}, trim_excess=True)) rechunked_tasks = [] for m in multi: # only rechunk if the chunks are too small if any(c1 < c2 for c1,c2 in zip(m.chunksize, dest_chunks)): rechunked_tasks.append(m.rechunk(dest_chunks, algorithm='tasks')) else: rechunked_tasks.append(m) mean_tasks = [m.mean() for m in rechunked_tasks] with dask.config.set({'optimization.fuse.active': False}): rechunked_p2p = [] for m in multi: # only rechunk if the chunks are too small if any(c1 < c2 for c1,c2 in zip(m.chunksize, dest_chunks)): rechunked_p2p.append(m.rechunk(dest_chunks, algorithm='p2p')) else: rechunked_p2p.append(m) mean_p2p = [m.mean() for m in rechunked_p2p] if __name__ == '__main__': num_cores = 8 cluster = LSFCluster( cores=num_cores, processes=1, memory=f"{15 * num_cores}GB", ncpus=num_cores, mem=15 * num_cores, walltime="72:00", ) cluster.scale(10) cl = Client(cluster) print(f"Begin distributed operations. Dask dashboard url: {cl.dashboard_link}") start = time.time() cl.compute(mean_p2p, sync=True) print(f"Completed p2p rechunking -> mean after {time.time() - start} s") start = time.time() cl.compute(mean_tasks, sync=True) print(f"Completed tasks rechunking -> mean after {time.time() - start} s") ```

first, I got an import error due to pyarrow missing from my python environment when I ran cl.compute. The error should probably happen earlier, e.g. as soon as I select algorithm=p2p in da.rechunk.

After installing pyarrow, I get a new error:

Traceback ```bash distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: '290a0eef76b7ab6e389634324025588a' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 204, in _create_array_rechunk_state for ts in self.scheduler.tasks[name].dependents: KeyError: 'shuffle-barrier-290a0eef76b7ab6e389634324025588a' distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: '290a0eef76b7ab6e389634324025588a' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 204, in _create_array_rechunk_state for ts in self.scheduler.tasks[name].dependents: KeyError: 'shuffle-barrier-290a0eef76b7ab6e389634324025588a' distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: '290a0eef76b7ab6e389634324025588a' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 204, in _create_array_rechunk_state for ts in self.scheduler.tasks[name].dependents: KeyError: 'shuffle-barrier-290a0eef76b7ab6e389634324025588a' distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: '290a0eef76b7ab6e389634324025588a' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 204, in _create_array_rechunk_state for ts in self.scheduler.tasks[name].dependents: KeyError: 'shuffle-barrier-290a0eef76b7ab6e389634324025588a' Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_rechunk.py", line 41, in rechunk_transfer return _get_worker_extension().add_partition( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 628, in add_partition shuffle = self.get_or_create_shuffle(shuffle_id, type=type, **kwargs) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 879, in get_or_create_shuffle return sync( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: '290a0eef76b7ab6e389634324025588a' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 204, in _create_array_rechunk_state for ts in self.scheduler.tasks[name].dependents: KeyError: 'shuffle-barrier-290a0eef76b7ab6e389634324025588a' raise exc.with_traceback(tb) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 378, in f result = yield future File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/tornado/gen.py", line 769, in run value = future.result() File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 709, in _get_or_create_shuffle shuffle = await self._refresh_shuffle( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 768, in _refresh_shuffle result = await self.worker.scheduler.shuffle_get_or_create( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 1227, in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 1011, in send_recv raise exc.with_traceback(tb) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 204, in _create_array_rechunk_state for ts in self.scheduler.tasks[name].dependents: KeyError: 'shuffle-barrier-290a0eef76b7ab6e389634324025588a' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/dev/cosem-flows/test_rechunking.py", line 54, in cl.compute(mean_p2p, sync=True) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 3351, in compute result = self.gather(futures) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 2305, in gather return self.sync( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 338, in sync return sync( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync raise exc.with_traceback(tb) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 378, in f result = yield future File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/tornado/gen.py", line 769, in run value = future.result() File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 2168, in _gather raise exception.with_traceback(traceback) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_rechunk.py", line 50, in rechunk_transfer raise RuntimeError("rechunk_transfer failed during shuffle {id}") from e RuntimeError: rechunk_transfer failed during shuffle {id} /groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/dask_jobqueue/core.py:237: FutureWarning: extra has been renamed to worker_extra_args. You are still using it (even if only set to []; please also check config files). If you did not set worker_extra_args yet, extra will be respected for now, but it will be removed in a future release. If you already set worker_extra_args, extra is ignored and you can remove it. warnings.warn(warn, FutureWarning) /groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/dask_jobqueue/core.py:255: FutureWarning: job_extra has been renamed to job_extra_directives. You are still using it (even if only set to []; please also check config files). If you did not set job_extra_directives yet, job_extra will be respected for now, but it will be removed in a future release. If you already set job_extra_directives, job_extra is ignored and you can remove it. warnings.warn(warn, FutureWarning) distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: '290a0eef76b7ab6e389634324025588a' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 204, in _create_array_rechunk_state for ts in self.scheduler.tasks[name].dependents: KeyError: 'shuffle-barrier-290a0eef76b7ab6e389634324025588a' ```

Any ideas what could be going wrong here?

jrbourbeau commented 1 year ago

Thanks for the feedback @d-v-b!

I got an import error due to pyarrow missing from my python environment when I ran cl.compute. The error should probably happen earlier, e.g. as soon as I select algorithm=p2p in da.rechunk.

Yeah, I'm able to reproduce this behavior. I agree it'd be nice if we could raise at graph construction time.

Any ideas what could be going wrong here?

Hrm, I'm not sure. cc @hendrikmakait @fjetter for visibility

fjetter commented 1 year ago

Thanks for trying this and reporting back.

The pyarrow dependency for rechunking is accidental. This is actually not required for rechunking which is why we're not raising early.

I can also reproduce the other error. This one looks like there is some kind of fusion happening even though dask.config.set({'optimization.fuse.active': False}) is set

Original graph fused_rechunk_no_opt

Optimized graph fused_rechunk

This is a bit unfortunate but possible to avoid. The problem here is that the optimization is actually triggered on compute and not on graph construction time, i.e. you need to include the client.compute() in the context manager of dask.config.set({'optimization.fuse.active': False}). We'll need to update our docs for this (and it should only be temporary)

fjetter commented 1 year ago

I opened https://github.com/dask/distributed/issues/7602 for the pyarrow import which we can close quickly. Getting rid of the fusion is a bit more tricky

d-v-b commented 1 year ago

ok, I for things working on my end by putting the compute call under the no-fusion context manager. I'm happy to close this unless there's value in keeping it open for tracking purposes?

d-v-b commented 1 year ago

I have new broken code for you :)

code ```python import dask.array as da import numpy as np import dask import time from distributed import Client from dask_jobqueue import LSFCluster import random import zarr if __name__ == '__main__': num_cores = 8 cluster = LSFCluster( cores=num_cores, processes=1, memory=f"{15 * num_cores}GB", ncpus=num_cores, mem=15 * num_cores, walltime="72:00", ) cluster.scale(10) cl = Client(cluster) print(f"Begin distributed operations. Dask dashboard url: {cl.dashboard_link}") shape = (533, 1946, 2625) read_chunks = (512,) * 3 dest_chunks = (64,) * 3 source = da.random.randint(0, 255, shape, chunks=read_chunks, dtype='uint8') # modify this dest = '/nrs/cellmap/bennettd/scratch/test.zarr/' multi = [source] num_levels = 8 for level in range(num_levels): multi.append(da.coarsen(np.mean, multi[-1], {d: 2 for d in range(source.ndim)}, trim_excess=True)) store = zarr.NestedDirectoryStore(dest) group = zarr.open(store, mode='a') start = time.time() with dask.config.set({'optimization.fuse.active': False}): targets = [group.create(shape=m.shape, name=str(idx), dtype=m.dtype, chunks=dest_chunks, overwrite=True) for idx, m in enumerate(multi)] to_store = [] for m in multi: # only rechunk if the chunks are too small if any(c1 < c2 for c1,c2 in zip(m.chunksize, dest_chunks)): to_store.append(m.rechunk(dest_chunks, algorithm='p2p')) else: to_store.append(m) cl.compute(da.store(to_store, targets, compute=False, lock=None), sync=True) print(f"Completed after {time.time() - start} s") ```
Traceback ```bash distributed.system - DEBUG - Setting system memory limit based on cgroup value defined in /sys/fs/cgroup/memory/memory.soft_limit_in_bytes Begin distributed operations. Dask dashboard url: http://10.37.7.134:8787/status distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: 'b27398dfd5514a812abb4e92a5bfe200' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 205, in _create_array_rechunk_state part = ts.annotations["shuffle"] KeyError: 'shuffle' distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: 'b27398dfd5514a812abb4e92a5bfe200' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 205, in _create_array_rechunk_state part = ts.annotations["shuffle"] KeyError: 'shuffle' distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: 'b27398dfd5514a812abb4e92a5bfe200' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 205, in _create_array_rechunk_state part = ts.annotations["shuffle"] KeyError: 'shuffle' distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: 'b27398dfd5514a812abb4e92a5bfe200' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 205, in _create_array_rechunk_state part = ts.annotations["shuffle"] KeyError: 'shuffle' distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: 'b27398dfd5514a812abb4e92a5bfe200' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 205, in _create_array_rechunk_state part = ts.annotations["shuffle"] KeyError: 'shuffle' distributed.core - ERROR - Exception while handling op shuffle_get_or_create Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 139, in get_or_create return self.get(id, worker) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 127, in get state = self.states[id] KeyError: '72434a667066535379a6086491d28387' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 205, in _create_array_rechunk_state part = ts.annotations["shuffle"] KeyError: 'shuffle' Traceback (most recent call last): File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_rechunk.py", line 41, in rechunk_transfer return _get_worker_extension().add_partition( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 628, in add_partition shuffle = self.get_or_create_shuffle(shuffle_id, type=type, **kwargs) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 879, in get_or_create_shuffle return sync( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync raise exc.with_traceback(tb) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 378, in f result = yield future File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/tornado/gen.py", line 769, in run value = future.result() File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 709, in _get_or_create_shuffle shuffle = await self._refresh_shuffle( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 768, in _refresh_shuffle result = await self.worker.scheduler.shuffle_get_or_create( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 1227, in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 1011, in send_recv raise exc.with_traceback(tb) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 205, in _create_array_rechunk_state part = ts.annotations["shuffle"] KeyError: 'shuffle' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/dev/cosem-flows/test_rechunking_2.py", line 53, in cl.compute(da.store(to_store, targets, compute=False, lock=None), sync=True) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 3351, in compute result = self.gather(futures) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 2305, in gather return self.sync( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 338, in sync return sync( File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync raise exc.with_traceback(tb) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 378, in f result = yield future File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/tornado/gen.py", line 769, in run value = future.result() File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 2168, in _gather raise exception.with_traceback(traceback) File "/groups/cellmap/home/bennettd/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_rechunk.py", line 50, in rechunk_transfer raise RuntimeError("rechunk_transfer failed during shuffle {id}") from e RuntimeError: rechunk_transfer failed during shuffle {id} ```
d-v-b commented 1 year ago

beyond the RuntimeError, it looks like there's an f-string that isn't getting prepended with the f properly: raise RuntimeError("rechunk_transfer failed during shuffle {id}") from e

jrbourbeau commented 1 year ago

beyond the RuntimeError

Thanks @d-v-b. That should be fixed by https://github.com/dask/distributed/pull/7600 (included in the patch release pushed out today https://github.com/dask/community/issues/309)

fjetter commented 1 year ago

@d-v-b do you still encounter this error after upgrading? I can no longer reproduce this on the latest version.

If you are still running into this, can you please try to slim down your code example. For instance

The smaller the example, the easier it is for us to help.

d-v-b commented 1 year ago

@fjetter after upgrading to the latest version of distributed I still get the KeyError: 'shuffle' when running my second example. I switched out the LSF cluster for LocalCluster, but somehow the call to da.store seems crucial to generating the error -- I can't elicit an error by simply calling .compute() or .mean() on all the multiscale arrays directly.

edit: I can remove zarr and just use a fake target for da.store, and I get the error

code (basically identical to the earlier example, sans LSFCluster and zarr) ```python import dask.array as da import numpy as np import dask import time import distributed from distributed import Client, LocalCluster class VoidStore(): """ A class that implements setitem as a no-op """ def __setitem__(*args): pass if __name__ == '__main__': print(f'dask version: {dask.__version__}') print(f'distributed version: {distributed.__version__}') num_cores = 8 cluster = LocalCluster(n_workers=0) cluster.scale(10) cl = Client(cluster) print(f"Begin distributed operations. Dask dashboard url: {cl.dashboard_link}") shape = (1024, 1024, 1024) read_chunks = (512,) * 3 dest_chunks = (64,) * 3 source = da.random.randint(0, 255, shape, chunks=read_chunks, dtype='uint8') multi = [source] num_levels = 8 for level in range(num_levels): multi.append(da.coarsen(np.mean, multi[-1], {d: 2 for d in range(source.ndim)}, trim_excess=True)) start = time.time() with dask.config.set({'optimization.fuse.active': False}): to_store = [] for m in multi: # only rechunk if the chunks are too small if any(c1 < c2 for c1,c2 in zip(m.chunksize, dest_chunks)): to_store.append(m.rechunk(dest_chunks, algorithm='p2p')) else: to_store.append(m) cl.compute(da.store(to_store, [VoidStore() for t in to_store], compute=False, lock=None), sync=True) print(f"Completed after {time.time() - start} s") ```
tail of the traceback ```bash Traceback (most recent call last): File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_rechunk.py", line 41, in rechunk_transfer return _get_worker_extension().add_partition( File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 628, in add_partition shuffle = self.get_or_create_shuffle(shuffle_id, type=type, **kwargs) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 879, in get_or_create_shuffle return sync( File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync raise exc.with_traceback(tb) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 378, in f result = yield future File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/tornado/gen.py", line 769, in run value = future.result() File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 709, in _get_or_create_shuffle shuffle = await self._refresh_shuffle( File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_worker_extension.py", line 768, in _refresh_shuffle result = await self.worker.scheduler.shuffle_get_or_create( File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 1227, in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 1011, in send_recv raise exc.with_traceback(tb) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/core.py", line 818, in _handle_comm result = handler(**msg) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 145, in get_or_create state = self._create_array_rechunk_state(id, spec) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_scheduler_extension.py", line 205, in _create_array_rechunk_state part = ts.annotations["shuffle"] KeyError: 'shuffle' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/groups/cellmap/home/bennettd/dev/cosem-flows/test_rechunking_2.py", line 45, in cl.compute(da.store(to_store, [VoidStore() for t in to_store], compute=False, lock=None), sync=True) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 3351, in compute result = self.gather(futures) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 2305, in gather return self.sync( File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 338, in sync return sync( File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync raise exc.with_traceback(tb) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/utils.py", line 378, in f result = yield future File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/tornado/gen.py", line 769, in run value = future.result() File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/client.py", line 2168, in _gather raise exception.with_traceback(traceback) File "/home/bennettd@hhmi.org/miniconda3/envs/cosem-flows/lib/python3.10/site-packages/distributed/shuffle/_rechunk.py", line 50, in rechunk_transfer raise RuntimeError(f"rechunk_transfer failed during shuffle {id}") from e RuntimeError: rechunk_transfer failed during shuffle 30ad498135b57ef5c819d783b368ebbd 2023-03-03 13:14:35,866 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-transfer-8b587d0b25fc40ba4c13136f1caa6a2c', 0, 0, 1) Function: rechunk_transfer args: (array([[[126.99784064, 127.01553392], [127.01201648, 127.0053497 ]], [[126.96722287, 127.013165 ], [127.02017999, 126.96109843]]]), '8b587d0b25fc40ba4c13136f1caa6a2c', (0, 0, 1), ((4,), (4,), (4,)), ((2, 2), (2, 2), (2, 2))) kwargs: {} Exception: "RuntimeError('rechunk_transfer failed during shuffle 8b587d0b25fc40ba4c13136f1caa6a2c')" 2023-03-03 13:14:35,877 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-transfer-d9516d6f326cacb3edf1564732397e12', 0, 0, 1) Function: rechunk_transfer args: (array([[[126.89551163, 127.16296005, 126.79346848, 127.07039642, 126.92192078, 127.01990128, 127.00488663, 127.10905075], [126.8924675 , 126.82146072, 127.05706024, 127.00632477, 126.97221756, 127.20618057, 126.89624405, 127.13584137], [127.0807991 , 126.96739197, 127.22413254, 126.77736282, 127.2110939 , 127.13186646, 126.81061935, 126.97898483], [126.72937393, 127.10009003, 127.10428619, 126.88643265, 126.91358185, 127.11713028, 126.94958115, 126.89729691], [126.84883118, 126.81511688, 127.35725784, 126.9691925 , 126.9767189 , 127.06669617, 127.04648209, 126.562397 ], [126.9746666 , 126.85813522, 126.88068771, 126.86870193, 126.92802811, 126.8413887 , 127.05707932, 126.95536423], [127.10054016, 126.96621704, 127.10403061, 126.96498108, 127.19545364, 126.94260406, 127.19434357, 126.94477463], [126.84261703, 127.3578186 , 126.94240189, 126.69282913, 126.9848 kwargs: {} Exception: "RuntimeError('rechunk_transfer failed during shuffle d9516d6f326cacb3edf1564732397e12')" ```
fjetter commented 1 year ago

Thanks @d-v-b this is helpful. I can reproduce with this but I can't offer a straight forward solution and we'll need to investigate a bit further.

hendrikmakait commented 1 year ago

I've been able to reduce the example further:

import dask.array as da
import dask
from distributed import Client, LocalCluster

class VoidStore():
    """
    A class that implements setitem as a no-op
    """
    def __setitem__(*args):
        pass

cluster = LocalCluster()
client = Client(cluster)
source = da.random.randint(0, 255, (4, ), chunks=(2, 2))
with dask.config.set({'optimization.fuse.active': False}):
    rechunked = source.rechunk((3, 1), algorithm='p2p')
    stored = da.store(rechunked, VoidStore(), lock=None)

The call to da.store causes the error (likely through unwanted optimization). I'll keep investigating.

fjetter commented 1 year ago

@hendrikmakait reading the store code I'm fairly certain we're loosing annotations here https://github.com/dask/dask/blob/c9a9edef0c7e996ef72cb474bf030778189fd737/dask/array/core.py#L1202-L1211 because we're switching from HLG to low level graph without carrying over any annotations

maawoo commented 10 months ago

Dear all,

I'm getting a similar KeyError:

2024-01-25 09:41:18,926 - distributed.core - ERROR - Exception while handling op shuffle_get
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm
    result = handler(**msg)
             ^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 128, in get
    state = self.active_shuffles[id]
            ~~~~~~~~~~~~~~~~~~~~^^^^
KeyError: 'd6f8b2d9ed84fca0a899360d7ddf2b2b'
2024-01-25 09:41:18,937 - distributed.core - ERROR - Exception while handling op shuffle_get
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm
    result = handler(**msg)
             ^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 128, in get
    state = self.active_shuffles[id]
            ~~~~~~~~~~~~~~~~~~~~^^^^
KeyError: 'd6f8b2d9ed84fca0a899360d7ddf2b2b'
2024-01-25 09:45:08,520 - distributed.core - ERROR - Exception while handling op shuffle_get
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm
    result = handler(**msg)
             ^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 128, in get
    state = self.active_shuffles[id]
            ~~~~~~~~~~~~~~~~~~~~^^^^
KeyError: 'd6f8b2d9ed84fca0a899360d7ddf2b2b'

... which looks like this in the dashboard once it occurs:

https://github.com/dask/distributed/assets/56583917/68cdf886-910d-447f-9c5f-698ae8e6a172

(Partial) Worker log
``` 2024-01-25 13:53:05,381 - distributed.core - INFO - Event loop was unresponsive in Worker for 9.78s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:53:05,666 - distributed.core - INFO - Event loop was unresponsive in Worker for 9.82s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:53:06,992 - distributed.worker - ERROR - failed during get data with tcp://172.18.1.10:42971 -> tcp://172.18.1.10:37101 Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 298, in write raise StreamClosedError() tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/worker.py", line 1779, in get_data compressed = await comm.write(msg, serializers=serializers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 309, in write convert_stream_closed_error(self, e) File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed 2024-01-25 13:53:06,994 - distributed.core - INFO - Lost connection to 'tcp://172.18.1.10:56702' Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 298, in write raise StreamClosedError() tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 969, in _handle_comm result = await result ^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 1116, in wrapper return await func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/worker.py", line 1779, in get_data compressed = await comm.write(msg, serializers=serializers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 309, in write convert_stream_closed_error(self, e) File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed 2024-01-25 13:53:07,047 - distributed.utils_comm - INFO - Retrying functools.partial(>, 'tcp://172.18.1.10:37101', [((256, 0, 0), [((0, 0, 6), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32))), ((0, 1, 6), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32))), ((0, 2, 6), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32))), ((0, 3, 6), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32))), ((0, 4, 6), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32))), ((0, 5, 5), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32))), ((0, 5, 6), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32))), ((0, 6, 5), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32))), ((0, 39, 5), ((256, 0, 0), array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32)))])]) after exception in attempt 0/10: in : BrokenPipeError: [Errno 32] Broken pipe 2024-01-25 13:53:07,132 - distributed.core - INFO - Event loop was unresponsive in Worker for 6.69s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:53:07,137 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.79s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:53:07,402 - distributed.worker - ERROR - Worker stream died during communication: tcp://172.18.1.10:37101 Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 227, in read frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 368, in read_bytes_rw actual = await stream.read_into(chunk) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/worker.py", line 2056, in gather_dep response = await get_data_from_worker( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/worker.py", line 2860, in get_data_from_worker response = await send_recv( ^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 1153, in send_recv response = await comm.read(deserializers=deserializers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 237, in read convert_stream_closed_error(self, e) File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in Worker for gather local=tcp://172.18.1.10:34810 remote=tcp://172.18.1.10:37101>: Stream is closed 2024-01-25 13:53:07,658 - distributed.shuffle._comms - ERROR - run_id=1 stale, got ArrayRechunkRun on tcp://172.18.1.10:33731 Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/utils.py", line 832, in wrapper return await func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_comms.py", line 72, in _process await self.send(address, shards) File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_core.py", line 215, in send return await retry( ^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/utils_comm.py", line 424, in retry return await coro() ^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_core.py", line 196, in _send return await self.rpc(address).shuffle_receive( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 1394, in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 1178, in send_recv raise exc.with_traceback(tb) File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 969, in _handle_comm result = await result ^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_worker_plugin.py", line 314, in shuffle_receive shuffle_run = await self._get_shuffle_run(shuffle_id, run_id) ^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_worker_plugin.py", line 368, in _get_shuffle_run return await self.shuffle_runs.get_with_run_id( ^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_worker_plugin.py", line 121, in get_with_run_id raise RuntimeError(f"{run_id=} stale, got {shuffle_run}") ^^^^^^^^^^^^^^^^^ RuntimeError: run_id=1 stale, got ArrayRechunkRun on tcp://172.18.1.10:33731 2024-01-25 13:53:07,794 - distributed.worker.memory - WARNING - Worker is at 82% memory usage. Pausing worker. Process memory: 4.11 GiB -- Worker memory limit: 5.00 GiB 2024-01-25 13:53:08,620 - distributed.worker - ERROR - failed during get data with tcp://172.18.1.10:37585 -> tcp://172.18.1.10:37101 Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/tornado/iostream.py", line 962, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/tornado/iostream.py", line 1124, in write_to_fd return self.socket.send(data) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^ BrokenPipeError: [Errno 32] Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/worker.py", line 1780, in get_data response = await comm.read(deserializers=serializers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 237, in read convert_stream_closed_error(self, e) File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in : BrokenPipeError: [Errno 32] Broken pipe 2024-01-25 13:53:08,719 - distributed.shuffle._comms - ERROR - run_id=1 stale, got ArrayRechunkRun on tcp://172.18.1.10:38385 Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/utils.py", line 832, in wrapper return await func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_comms.py", line 72, in _process await self.send(address, shards) File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_core.py", line 215, in send return await retry( ^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/utils_comm.py", line 424, in retry return await coro() ^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_core.py", line 196, in _send return await self.rpc(address).shuffle_receive( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 1394, in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 1178, in send_recv raise exc.with_traceback(tb) File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 969, in _handle_comm result = await result ^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_worker_plugin.py", line 314, in shuffle_receive shuffle_run = await self._get_shuffle_run(shuffle_id, run_id) ^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_worker_plugin.py", line 368, in _get_shuffle_run return await self.shuffle_runs.get_with_run_id( ^^^^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_worker_plugin.py", line 121, in get_with_run_id raise RuntimeError(f"{run_id=} stale, got {shuffle_run}") ^^^^^^^^^^^^^^^^^ RuntimeError: run_id=1 stale, got ArrayRechunkRun on tcp://172.18.1.10:38385 ```

Unfortunately, I've not been able to create a minimal example that is reproducible on a LocalCluster. I'm working with satellite imagery which is stored on a fileserver connected to compute nodes of my university's HPC cluster. The imagery is mirrored from this data product and I'm using this code to start a dask_jobqueue.SLURMCluster.


I've been able to (kind of (*)) reproduce this using the following code but still only on our HPC Cluster and not locally. Instead of loading the satellite imagery from our fileserver, it streams a similar data product from a cloud resource using planetary-computer and odc-stac:

Example code
```python import pystac_client import planetary_computer from odc.stac import load as odc_stac_load import xarray as xr import numpy as np catalog = pystac_client.Client.open( "https://planetarycomputer.microsoft.com/api/stac/v1", modifier=planetary_computer.sign_inplace, ) time_range = "2020-01-01/2024-01-01" bbox = (31.15, -25.55, 31.95, -24.15) search = catalog.search(collections=["sentinel-2-l2a"], bbox=bbox, datetime=time_range) items = search.item_collection() params = {"crs": "EPSG:4326", "resolution": 0.0002, "resampling": "bilinear", "chunks": {'time': 1, 'latitude': 'auto', 'longitude': 'auto'}} data = odc_stac_load(items=items, bands=['B12'], bbox=bbox, dtype='uint16', **params) data_mask = odc_stac_load(items=items, bands=['SCL'], bbox=bbox, dtype='uint8', **params) valid = ((data_mask.SCL == 2) | (data_mask.SCL > 3) & (data_mask.SCL <= 7) | (data_mask.SCL == 11) ) data = xr.where(valid, data, np.nan).astype("float32") data = data.chunk({'time': -1, 'latitude': 'auto', 'longitude': 'auto'}) data.coords['time'] = data.time.dt.round('1h') data = data.groupby('time').mean(skipna=True) n_valid_px = data.B12.notnull().sum(dim="time").compute() ```

(*) kind of, because the error is slightly different and also includes something about shuffle_restrict_task:

Slightly different error traceback
```python 2024-01-25 13:40:54,629 - distributed.core - ERROR - Exception while handling op shuffle_restrict_task Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm result = handler(**msg) ^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 102, in restrict_task shuffle = self.active_shuffles[id] ~~~~~~~~~~~~~~~~~~~~^^^^ KeyError: '5971f6e58ff2f2162ebf83479df64dc8' 2024-01-25 13:40:57,274 - distributed.core - ERROR - Exception while handling op shuffle_get Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm result = handler(**msg) ^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 128, in get state = self.active_shuffles[id] ~~~~~~~~~~~~~~~~~~~~^^^^ KeyError: '5971f6e58ff2f2162ebf83479df64dc8' 2024-01-25 13:40:57,283 - distributed.core - ERROR - Exception while handling op shuffle_get Traceback (most recent call last): File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm result = handler(**msg) ^^^^^^^^^^^^^^ File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 128, in get state = self.active_shuffles[id] ... File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 128, in get state = self.active_shuffles[id] ~~~~~~~~~~~~~~~~~~~~^^^^ KeyError: '5971f6e58ff2f2162ebf83479df64dc8' ```

The error also happens at a different part in the workflow (after the shuffle barrier) as you can see in this dashboard recording:

https://github.com/dask/distributed/assets/56583917/2131df59-442c-43bd-836f-1d2234b1c673

Here is also the log output of a worker, which includes an additional warning that is related to P2P:

Worker Log
``` 2024-01-25 13:36:44,533 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.1.11:46523' 2024-01-25 13:36:44,574 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.1.11:37847' 2024-01-25 13:36:44,593 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.1.11:39251' 2024-01-25 13:36:44,614 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.1.11:40303' 2024-01-25 13:36:44,641 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.1.11:43475' 2024-01-25 13:36:44,649 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.1.11:36103' 2024-01-25 13:36:44,683 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.1.11:46453' 2024-01-25 13:36:44,735 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.1.11:41633' 2024-01-25 13:36:47,181 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/du23yow/dask-scratch-space/worker-b6x5usvt', purging 2024-01-25 13:36:47,182 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/du23yow/dask-scratch-space/worker-fks2pkot', purging 2024-01-25 13:36:47,183 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/du23yow/dask-scratch-space/worker-fcn1c2js', purging 2024-01-25 13:36:47,184 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/du23yow/dask-scratch-space/worker-4k3arav2', purging 2024-01-25 13:36:47,185 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/du23yow/dask-scratch-space/worker-ck2f8n04', purging 2024-01-25 13:36:47,185 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/du23yow/dask-scratch-space/worker-ne4ntds2', purging 2024-01-25 13:36:47,186 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/du23yow/dask-scratch-space/worker-q12_ymo0', purging 2024-01-25 13:36:47,187 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/du23yow/dask-scratch-space/worker-ihfc7fv6', purging 2024-01-25 13:36:48,136 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:46241 2024-01-25 13:36:48,137 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:46241 2024-01-25 13:36:48,138 - distributed.worker - INFO - Worker name: SLURMCluster-0-1 2024-01-25 13:36:48,139 - distributed.worker - INFO - dashboard at: 172.18.1.11:35053 2024-01-25 13:36:48,139 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:43707 2024-01-25 13:36:48,140 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:36:48,141 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:43707 2024-01-25 13:36:48,142 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,143 - distributed.worker - INFO - Worker name: SLURMCluster-0-7 2024-01-25 13:36:48,144 - distributed.worker - INFO - Threads: 4 2024-01-25 13:36:48,144 - distributed.worker - INFO - dashboard at: 172.18.1.11:39381 2024-01-25 13:36:48,145 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:36:48,146 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:36:48,147 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-nsnp5s9k 2024-01-25 13:36:48,147 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,148 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,149 - distributed.worker - INFO - Threads: 4 2024-01-25 13:36:48,150 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:36:48,151 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-ph9a352n 2024-01-25 13:36:48,152 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,156 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:43661 2024-01-25 13:36:48,157 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:43661 2024-01-25 13:36:48,157 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:38341 2024-01-25 13:36:48,157 - distributed.worker - INFO - Worker name: SLURMCluster-0-3 2024-01-25 13:36:48,158 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:38341 2024-01-25 13:36:48,158 - distributed.worker - INFO - dashboard at: 172.18.1.11:43803 2024-01-25 13:36:48,159 - distributed.worker - INFO - Worker name: SLURMCluster-0-2 2024-01-25 13:36:48,160 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:36:48,161 - distributed.worker - INFO - dashboard at: 172.18.1.11:34447 2024-01-25 13:36:48,162 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,163 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:36:48,164 - distributed.worker - INFO - Threads: 4 2024-01-25 13:36:48,165 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,166 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:36:48,167 - distributed.worker - INFO - Threads: 4 2024-01-25 13:36:48,168 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-g85jdqbe 2024-01-25 13:36:48,169 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:36:48,171 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,172 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-dkgevxqu 2024-01-25 13:36:48,174 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,176 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:36993 2024-01-25 13:36:48,177 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:36993 2024-01-25 13:36:48,178 - distributed.worker - INFO - Worker name: SLURMCluster-0-6 2024-01-25 13:36:48,179 - distributed.worker - INFO - dashboard at: 172.18.1.11:44655 2024-01-25 13:36:48,180 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:36:48,181 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,182 - distributed.worker - INFO - Threads: 4 2024-01-25 13:36:48,182 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:33131 2024-01-25 13:36:48,183 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:36:48,184 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:33131 2024-01-25 13:36:48,184 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-e7839uaz 2024-01-25 13:36:48,184 - distributed.worker - INFO - Worker name: SLURMCluster-0-4 2024-01-25 13:36:48,185 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,185 - distributed.worker - INFO - dashboard at: 172.18.1.11:37513 2024-01-25 13:36:48,186 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:36:48,187 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,187 - distributed.worker - INFO - Threads: 4 2024-01-25 13:36:48,188 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:36:48,188 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-bfxj9pii 2024-01-25 13:36:48,189 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,195 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:36325 2024-01-25 13:36:48,196 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:36325 2024-01-25 13:36:48,197 - distributed.worker - INFO - Worker name: SLURMCluster-0-5 2024-01-25 13:36:48,198 - distributed.worker - INFO - dashboard at: 172.18.1.11:39791 2024-01-25 13:36:48,198 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:36:48,199 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,200 - distributed.worker - INFO - Threads: 4 2024-01-25 13:36:48,201 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:36:48,202 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-9fky3uvv 2024-01-25 13:36:48,202 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,286 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:37743 2024-01-25 13:36:48,287 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:37743 2024-01-25 13:36:48,287 - distributed.worker - INFO - Worker name: SLURMCluster-0-0 2024-01-25 13:36:48,288 - distributed.worker - INFO - dashboard at: 172.18.1.11:37609 2024-01-25 13:36:48,289 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:36:48,290 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:48,291 - distributed.worker - INFO - Threads: 4 2024-01-25 13:36:48,292 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:36:48,293 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-p2y7l_ib 2024-01-25 13:36:48,293 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:49,979 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:36:49,981 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:36:49,982 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:49,985 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 2024-01-25 13:36:50,011 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:36:50,013 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:36:50,014 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:50,016 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 2024-01-25 13:36:50,016 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:36:50,021 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:36:50,024 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:50,028 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 2024-01-25 13:36:50,073 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:36:50,076 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:36:50,078 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:50,081 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 2024-01-25 13:36:50,085 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:36:50,088 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:36:50,089 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:50,091 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 2024-01-25 13:36:50,097 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:36:50,100 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:36:50,102 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:50,104 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 2024-01-25 13:36:50,119 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:36:50,121 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:36:50,123 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:50,126 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 2024-01-25 13:36:50,128 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:36:50,131 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:36:50,133 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:36:50,136 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 2024-01-25 13:37:18,674 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.03s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:37:18,674 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.02s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:37:18,678 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.03s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:37:18,692 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.06s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:37:18,693 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.05s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:37:18,696 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.05s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:37:18,698 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.05s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. 2024-01-25 13:37:18,700 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.05s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. /home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned. _reproject( /home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned. _reproject( /home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned. _reproject( /home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned. _reproject( /home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned. _reproject( /home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned. _reproject( 2024-01-25 13:40:52,447 - distributed.utils_perf - INFO - full garbage collection released 29.36 MiB from 44704 reference cycles (threshold: 9.54 MiB) 2024-01-25 13:40:52,906 - distributed.utils_perf - INFO - full garbage collection released 822.72 MiB from 28083 reference cycles (threshold: 9.54 MiB) 2024-01-25 13:40:54,422 - distributed.nanny.memory - WARNING - Worker tcp://172.18.1.11:43661 (pid=3518498) exceeded 95% memory budget. Restarting... 2024-01-25 13:40:54,503 - distributed.nanny - INFO - Worker process 3518498 was killed by signal 15 2024-01-25 13:40:57,159 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-68c9a09f2f9a75b30572b2b8a359a22f', 0, 8, 11) Function: rechunk_unpack args: ('5971f6e58ff2f2162ebf83479df64dc8', (0, 8, 11), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling 5971f6e58ff2f2162ebf83479df64dc8 failed during unpack phase')" 2024-01-25 13:40:57,270 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-68c9a09f2f9a75b30572b2b8a359a22f', 0, 2, 0) Function: rechunk_unpack args: ('5971f6e58ff2f2162ebf83479df64dc8', (0, 2, 0), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling 5971f6e58ff2f2162ebf83479df64dc8 failed during unpack phase')" 2024-01-25 13:40:57,309 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-68c9a09f2f9a75b30572b2b8a359a22f', 0, 6, 0) Function: rechunk_unpack args: ('5971f6e58ff2f2162ebf83479df64dc8', (0, 6, 0), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling 5971f6e58ff2f2162ebf83479df64dc8 failed during unpack phase')" 2024-01-25 13:40:57,319 - distributed.nanny - WARNING - Restarting worker 2024-01-25 13:40:58,390 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker. Process memory: 4.52 GiB -- Worker memory limit: 5.00 GiB 2024-01-25 13:40:58,610 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-68c9a09f2f9a75b30572b2b8a359a22f', 0, 6, 11) Function: rechunk_unpack args: ('5971f6e58ff2f2162ebf83479df64dc8', (0, 6, 11), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling 5971f6e58ff2f2162ebf83479df64dc8 failed during unpack phase')" 2024-01-25 13:40:58,611 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-68c9a09f2f9a75b30572b2b8a359a22f', 0, 12, 11) Function: rechunk_unpack args: ('5971f6e58ff2f2162ebf83479df64dc8', (0, 12, 11), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling 5971f6e58ff2f2162ebf83479df64dc8 failed during unpack phase')" 2024-01-25 13:40:58,674 - distributed.worker.memory - WARNING - Worker is at 43% memory usage. Resuming worker. Process memory: 2.17 GiB -- Worker memory limit: 5.00 GiB 2024-01-25 13:40:59,099 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker. Process memory: 4.50 GiB -- Worker memory limit: 5.00 GiB 2024-01-25 13:40:59,808 - distributed.worker.memory - WARNING - Worker is at 63% memory usage. Resuming worker. Process memory: 3.19 GiB -- Worker memory limit: 5.00 GiB 2024-01-25 13:41:00,319 - distributed.utils_perf - INFO - full garbage collection released 39.99 MiB from 53831 reference cycles (threshold: 9.54 MiB) 2024-01-25 13:41:00,319 - distributed.worker.memory - WARNING - Worker is at 82% memory usage. Pausing worker. Process memory: 4.10 GiB -- Worker memory limit: 5.00 GiB 2024-01-25 13:41:00,325 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-68c9a09f2f9a75b30572b2b8a359a22f', 0, 2, 13) Function: rechunk_unpack args: ('5971f6e58ff2f2162ebf83479df64dc8', (0, 2, 13), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling 5971f6e58ff2f2162ebf83479df64dc8 failed during unpack phase')" 2024-01-25 13:41:00,394 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:38919 2024-01-25 13:41:00,394 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:38919 2024-01-25 13:41:00,395 - distributed.worker - INFO - Worker name: SLURMCluster-0-3 2024-01-25 13:41:00,396 - distributed.worker - INFO - dashboard at: 172.18.1.11:38637 2024-01-25 13:41:00,396 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:41:00,397 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:41:00,397 - distributed.worker - INFO - Threads: 4 2024-01-25 13:41:00,398 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:41:00,398 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-rqu_n7jl 2024-01-25 13:41:00,399 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:41:00,631 - distributed.worker.memory - WARNING - Worker is at 52% memory usage. Resuming worker. Process memory: 2.65 GiB -- Worker memory limit: 5.00 GiB 2024-01-25 13:41:01,852 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:41:01,854 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:41:01,857 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:41:01,859 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 /home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned. _reproject( 2024-01-25 13:41:04,435 - distributed.worker.memory - WARNING - Worker is at 93% memory usage. Pausing worker. Process memory: 4.66 GiB -- Worker memory limit: 5.00 GiB 2024-01-25 13:41:04,624 - distributed.nanny.memory - WARNING - Worker tcp://172.18.1.11:37743 (pid=3518472) exceeded 95% memory budget. Restarting... 2024-01-25 13:41:04,678 - distributed.nanny - INFO - Worker process 3518472 was killed by signal 15 2024-01-25 13:41:04,917 - distributed.nanny - WARNING - Restarting worker 2024-01-25 13:41:08,432 - distributed.worker - INFO - Start worker at: tcp://172.18.1.11:33497 2024-01-25 13:41:08,434 - distributed.worker - INFO - Listening to: tcp://172.18.1.11:33497 2024-01-25 13:41:08,435 - distributed.worker - INFO - Worker name: SLURMCluster-0-0 2024-01-25 13:41:08,436 - distributed.worker - INFO - dashboard at: 172.18.1.11:36481 2024-01-25 13:41:08,436 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.10.2:37891 2024-01-25 13:41:08,437 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:41:08,438 - distributed.worker - INFO - Threads: 4 2024-01-25 13:41:08,439 - distributed.worker - INFO - Memory: 5.00 GiB 2024-01-25 13:41:08,440 - distributed.worker - INFO - Local Directory: /scratch/du23yow/dask-scratch-space/worker-h6xqjqsj 2024-01-25 13:41:08,442 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:41:09,991 - distributed.worker - INFO - Starting Worker plugin shuffle 2024-01-25 13:41:09,993 - distributed.worker - INFO - Registered to: tcp://172.18.10.2:37891 2024-01-25 13:41:09,994 - distributed.worker - INFO - ------------------------------------------------- 2024-01-25 13:41:09,997 - distributed.core - INFO - Starting established connection to tcp://172.18.10.2:37891 slurmstepd: error: *** JOB 894509 ON node011 CANCELLED AT 2024-01-25T13:45:00 *** ```

I hope all of this makes sense and someone is able to connect the dots here.

hendrikmakait commented 10 months ago

@maawoo: Thanks for reporting your problem. What version of dask and distributed are you running? The output from client.get_versions() should suffice here.

hendrikmakait commented 10 months ago

Just looking at the provided output, it seems like your workers are running out of memory. I'll have to look deeper into your workload characteristics to make good recommendations on how to avoid that scenario, though.

maawoo commented 10 months ago

Here the output for client.get_versions() for the scheduler (workers use the same versions):

{'python': '3.11.7.final.0', 'python-bits': 64, 'OS': 'Linux', 'OS-release': '4.18.0-425.13.1.el8_7.x86_64', 'machine': 'x86_64', 'processor': 'x86_64', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'en_US.UTF-8'}, 'packages': {'python': '3.11.7.final.0', 'dask': '2024.1.0', 'distributed': '2024.1.0', 'msgpack': '1.0.7', 'cloudpickle': '3.0.0', 'tornado': '6.3.3', 'toolz': '0.12.0', 'numpy': '1.26.3', 'pandas': '2.2.0', 'lz4': None}

fjetter commented 10 months ago

It looks like you're running on rather small workers. The dashboard looks like every worker has only about 6GB of memory which is a little small and doesn't give us a lot of room to move. I totally agree that this KeyError is a bad UX but I also recommend running on slightly larger machines

maawoo commented 10 months ago

Thanks for your quick replies! You're right... I gave my workers more memory to work with (10 GiB each) and the process finished without issues. To me the worker memory usage in the first dashboard recording looks like nothing compared to when using the default tasks rechunk method (thanks for p2p, btw!). That's why I was not considering it as the root cause of all of these errors and warnings I got.

maawoo commented 10 months ago

Even though it works better with the increase of memory per worker, I occasionally run into some form of this problem. Now the error message looks like this:

2024-01-26 12:29:58,292 - distributed.core - ERROR - Exception while handling op shuffle_restrict_task
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm
    result = handler(**msg)
             ^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 102, in restrict_task
    shuffle = self.active_shuffles[id]
              ~~~~~~~~~~~~~~~~~~~~^^^^
KeyError: 'd6f8b2d9ed84fca0a899360d7ddf2b2b'
2024-01-26 12:30:02,425 - distributed.core - ERROR - Exception while handling op shuffle_restrict_task
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm
    result = handler(**msg)
             ^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 102, in restrict_task
    shuffle = self.active_shuffles[id]
              ~~~~~~~~~~~~~~~~~~~~^^^^
KeyError: 'd6f8b2d9ed84fca0a899360d7ddf2b2b'
2024-01-26 12:30:02,431 - distributed.core - ERROR - Exception while handling op shuffle_restrict_task
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/core.py", line 967, in _handle_comm
    result = handler(**msg)
             ^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/dev_sdc_env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 102, in restrict_task
    shuffle = self.active_shuffles[id]
              ~~~~~~~~~~~~~~~~~~~~^^^^
KeyError: 'd6f8b2d9ed84fca0a899360d7ddf2b2b'

As opposed to my initial report, this time I don't see any warnings regarding worker memory such as distributed.worker.memory - WARNING - Worker is at xx% memory usage. or distributed.nanny.memory - WARNING - Worker [...] exceeded xx% memory budget. Restarting.... The following information was logged by my workers at the time the above error occurred:

Worker 1 ``` 2024-01-26 12:29:58,054 - distributed.nanny - INFO - Worker process 1226259 was killed by signal 11 ```
Worker 2 ``` 2024-01-26 12:29:59,469 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-92b67845722c7202cf413f1732afae90', 0, 34, 28) Function: rechunk_unpack args: ('d6f8b2d9ed84fca0a899360d7ddf2b2b', (0, 34, 28), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling d6f8b2d9ed84fca0a899360d7ddf2b2b failed during unpack phase')" 2024-01-26 12:30:02,416 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-92b67845722c7202cf413f1732afae90', 0, 22, 20) Function: rechunk_unpack args: ('d6f8b2d9ed84fca0a899360d7ddf2b2b', (0, 22, 20), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling d6f8b2d9ed84fca0a899360d7ddf2b2b failed during unpack phase')" 2024-01-26 12:30:02,441 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-92b67845722c7202cf413f1732afae90', 0, 10, 9) Function: rechunk_unpack args: ('d6f8b2d9ed84fca0a899360d7ddf2b2b', (0, 10, 9), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling d6f8b2d9ed84fca0a899360d7ddf2b2b failed during unpack phase')" 2024-01-26 12:30:02,448 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-92b67845722c7202cf413f1732afae90', 0, 18, 13) Function: rechunk_unpack args: ('d6f8b2d9ed84fca0a899360d7ddf2b2b', (0, 18, 13), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling d6f8b2d9ed84fca0a899360d7ddf2b2b failed during unpack phase')" ```
Worker 3 ``` 2024-01-26 12:29:58,996 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-92b67845722c7202cf413f1732afae90', 0, 6, 22) Function: rechunk_unpack args: ('d6f8b2d9ed84fca0a899360d7ddf2b2b', (0, 6, 22), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling d6f8b2d9ed84fca0a899360d7ddf2b2b failed during unpack phase')" 2024-01-26 12:30:00,123 - distributed.worker - WARNING - Compute Failed Key: ('rechunk-p2p-92b67845722c7202cf413f1732afae90', 0, 25, 13) Function: rechunk_unpack args: ('d6f8b2d9ed84fca0a899360d7ddf2b2b', (0, 25, 13), 1) kwargs: {} Exception: "RuntimeError('P2P shuffling d6f8b2d9ed84fca0a899360d7ddf2b2b failed during unpack phase')" ```
fjetter commented 10 months ago

signal 11 is a segmentation fault. This isn't great. Can you please share what kind of data schema you are using? My best guess is that this happens somewhere in pyarrow so we'd also need the pyarrow version used

@hendrikmakait we should figure out a way how we can reraise exceptions like this with a clear error that indicates a worker died.