Open chrisroat opened 2 years ago
After deleting the pods with the 5 wedged workers, the graph completed.
I assume these are the five workers which have so much data spilled on the dashboard? (See bytes stored by worker)
"unresponsive... long-running ... GIL" errors -- though it look like the running tasks are
stack
.
Anything interesting happening in this function? E.g. is there native code running which may not release the GIL?
After deleting the pods with the 5 wedged workers, the graph completed.
That strongly indicates another state machine problem.
I think the stack
task is just a da.stack
call, though it's possible that optimization fused some other things. The CPU has dropped to zero.
I'm not sure how to better help diagnose the state machine issues.
I find most problems with autoscaling clusters. I could set the cluster size to very high numbers, but the workloads vary a lot and this would be wasteful. I already waste a lot because my autoscale lower bound needs to be high enough to prevent KilledWorker
issues while waiting for new workers to come online... so the tail end of the graph where a few tasks are left has very low utilization.
I attempted to use dump_cluster_state, but hit a timeout error.
These timeouts are not unexpected in such an auto scaling situation. The latest release (2022.01.0) includes a fix to the cluster dump function to not fail in case there are timeouts to individual workers.
I just encountered a new mode, where the dashboard says 3 tasks are processing but the graph in the lower left shows 0 (and the Info page shows no workers with anything in "processing"). I typically have to delete a pod to unstick things, but this time I don't know which pod!
I searched through the logs for "overlap-getitem", since that is the most upstream task prefix that is wedged. In the scheduler, I see 12 lines in succession (always the same key, but several different workers) in the scheduler:
Unexpected worker completed task. Expected: <WorkerState 'tls://10.18.96.2:43711', name: dask-worker-5ed6bca7ff4e4dfdb88a823c2b5e540b-jcxzh, status: running, memory: 809, processing: 205>, Got: <WorkerState 'tls://10.16.244.2:36389', name: dask-worker-5ed6bca7ff4e4dfdb88a823c2b5e540b-trg9s, status: running, memory: 139, processing: 28>, Key: ('overlap-getitem-666998704fa0ead8d1aecce3291ebb76', 1, 2, 18)
I tried deleting the workers mentioned, some of which are still around, but that did not bring back processing of the missing keys.
About 25 seconds before those log entries, I see:
distributed.core - INFO - Event loop was unresponsive in Scheduler for 31.66s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability
Some 50 minutes later:
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/opt/conda/lib/python3.8/site-packages/distributed/stealing.py", line 477, in balance
maybe_move_task(
File "/opt/conda/lib/python3.8/site-packages/distributed/stealing.py", line 378, in maybe_move_task
self.move_task_request(ts, sat, idl)
File "/opt/conda/lib/python3.8/site-packages/distributed/stealing.py", line 258, in move_task_request
self.scheduler.stream_comms[victim.address].send(
KeyError: 'tls://10.18.96.2:43711'
That key is seen in the logs as a worker being retired just a few seconds prior to the "Unexpected worker completed task" messages. The same key is re-registered in the middle of those "Unexpected worker completed task" messages.
The cluster dump would not upload to github, so I put it here:
https://drive.google.com/file/d/1okQ6M8M8smP0oBhvlrdbQLJJbWp7_32U/view?usp=sharing
We have recently started seeing this behavior as well, currently on dask+distributed 2022.1.0 but I think it's been happening for a couple of releases. Same pattern as @chrisroat:
2022-01-21 04:02:49.930 PST
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://10.107.43.20:34655', name: hardisty-f5562fc3-daskworkers-6c8f6dbdf4-68zs4, status: running, memory: 9, processing: 256>
client.processing()
shows nothing in progress but the Status page still shows the correct progress value@crusaderky @fjetter any chance #5665 could be related given that this worker was left for dead but then kept going...?
@bnaul #5665 mitigates, but does not fix, the issue. Everything in the opening post, particularly the memory usage graph of the stuck workers, seems to indicate that this may be a duplicate of https://github.com/dask/distributed/issues/3761:
I can see from your dashboard that 4 workers are paused with extremely high amounts of spilled memory. They never get out of paused state and they hold a bunch of queued tasks, which are not returned to the scheduler. The reason why they don't get out of paused state may be (educated guess) that you ran out of disk space and they can't swap out anymore. As of the latest release, the error handling for this use case is fragile - see https://github.com/dask/distributed/issues/5364 for improvement.
I currently have two changes in my pipeline:
While you wait for those, I would suggest you try to figure out why you have 180 GiB worth of managed data per worker on 4 workers while the rest of the cluster is relatively free.
Do those workers contain replicas of data that already exists elsewhere? In that case, the new Active Memory Manager could help you: https://coiled.io/blog/introducing-the-dask-active-memory-manager/
You can also try running rebalance() periodically. Word of warning - the method is not safe to run on a busy scheduler. Making it safe and automatic is also in my pipeline.
I see this consolidation of tasks a few workers when I use auto-scaling and spot instances. I think the addition/deletion of workers is throwing a curveball to the scheduler. I tried to repro in #4471 (no deadlock -- just bad scheduling), where you see a small number of workers get all the tasks as the cluster scales up.
The graph above is a map_overlap on a 3d-chunked array with 21720=680 chunks, reading from and writing to the zarr format. This leads to 82k tasks(!). I notice in an issue you link that one can adjust fuse parameters. Is that worth trying -- could it reduce the number of tasks and perhaps lower the rate of wonky scheduling & deadlocks?
FWIW, I did attempt the AMM at one point. If I set the env var DASK_DISTRIBUTEDSCHEDULERACTIVE_MEMORY-MANAGER__START to "true", it turns on -- do I need additional settings?
I see this consolidation of tasks a few workers when I use auto-scaling and spot instances. I think the addition/deletion of workers is throwing a curveball to the scheduler.
Do you mean that the 4 workers that I see saturated in the dashboard are the initial ones and the rest of the cluster only goes online when they are already very full?
The graph above is a map_overlap on a 3d-chunked array with 2_17_20=680 chunks, reading from and writing to the zarr format. This leads to 82k tasks(!). I notice in an issue you link that one can adjust fuse parameters. Is that worth trying -- could it reduce the number of tasks and perhaps lower the rate of wonky scheduling & deadlocks?
fuse parameters can reduce the number of tasks, which in turn translates in a reduction in CPU load on the scheduler. If your scheduler CPU is frequently near 100%, this should in turn translate in a reduction in random timeouts and disconnects. I don't think this is your case - unless you're frequently seeing them in the logs.
FWIW, I did attempt the AMM at one point. If I set the env var DASK_DISTRIBUTEDSCHEDULERACTIVE_MEMORY-MANAGER__START to "true", it turns on -- do I need additional settings?
Should be enough. Please run client.amm.running()
to double check.
Could you run client.rebalance()
every once in a while? I wouldn't recommend it as a production fix (it's not robust in the main branch while computations are running), but if it does help you it will give me a case for raising the priority of the rework I'm doing on it.
Do you mean that the 4 workers that I see saturated in the dashboard are the initial ones and the rest of the cluster only goes online when they are already very full?
I don't think so. In the issue I references (#4471), I scale from 1 to 4 to 20 workers. It looks like 6 workers get the bulk of the load, and a few others get some load.
fuse parameters can reduce the number of tasks, which in turn translates in a reduction in CPU load on the scheduler. If your scheduler CPU is frequently near 100%, this should in turn translate in a reduction in random timeouts and disconnects. I don't think this is your case - unless you're frequently seeing them in the logs.
I do see a lot of comm errors in the logs.
FWIW, I did attempt the AMM at one point. If I set the env var DASK_DISTRIBUTEDSCHEDULERACTIVE_MEMORY-MANAGER__START to "true", it turns on -- do I need additional settings?
Should be enough. Please run
client.amm.running()
to double check.Could you run
client.rebalance()
every once in a while? I wouldn't recommend it as a production fix (it's not robust in the main branch while computations are running), but if it does help you it will give me a case for raising the priority of the rework I'm doing on it.
I could try that on some future run. I do see tons of GIL errors, which I guess must be the read/write to disk that is happening. It does seem like distributing the work better would mitigate things.
FYI I recently merged a PR which might help in this situation. At least the failure scenario I fixed there is likely to happen in scaling situations, particularly if workers are removed https://github.com/dask/distributed/pull/5786 Note that we got already another deadlock reported (off github) after this fix so we're not over the hill, yet.
The "few workers taking all the tasks" thing is explainable and fixable by https://github.com/dask/distributed/pull/6115 .
This doesn't explain things fully halting though, unless possibly you've run out of disk space (is this possible?)
What happened:
I have a dask-gateway cluster on k8s, and a 63k task graph. The processing has essentially stalled. Five workers have 375 tasks each, while the remaining 75 have zero. The scheduler logs have a bunch of timeout errors. The workers with all the tasks have the "unresponsive... long-running ... GIL" errors -- though it look like the running tasks are
stack
.I attempted to use
dump_cluster_state
, but hit a timeout error.Anything else we need to know?:
After deleting the pods with the 5 wedged workers, the graph completed.
Environment:
Cluster Dump State:
``` --------------------------------------------------------------------------- TimeoutError Traceback (most recent call last) /opt/conda/lib/python3.8/site-packages/distributed/comm/core.py in connect() 318 # write, handshake = await asyncio.gather(comm.write(local_info), comm.read()) --> 319 handshake = await asyncio.wait_for(comm.read(), time_left()) 320 await asyncio.wait_for(comm.write(local_info), time_left()) /opt/conda/lib/python3.8/asyncio/tasks.py in wait_for() 500 await _cancel_and_wait(fut, loop=loop) --> 501 raise exceptions.TimeoutError() 502 finally: TimeoutError: The above exception was the direct cause of the following exception: OSError Traceback (most recent call last) /tmp/ipykernel_894/3630066435.py in