dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 720 forks source link

Task stuck in "processing" on closed worker #6263

Closed bnaul closed 2 years ago

bnaul commented 2 years ago

Similar at a high level to #6198 but a slightly different manifestation: dashboard shows 9 remaining tasks (one is a parent task that spawned the other 8 by calling dd.read_parquet), but the Info page shows only the one parent task processing. image

In the case of #6198 the worker showed up in the scheduler Info page (but would 404 when you tried to click through to its info); here the scheduler knows the workers are gone, but there are still tasks assigned to them anyway:

# Just the one parent task
In [33]: {k: v for k, v in c.processing().items() if v}
Out[33]: {'tcp://10.124.10.46:43097': ('partition_inputs-ea98882db0754e4497b1dcdd7d22e236',)}

# Here all the stragglers show up; each one is on a worker with status="closed"
In [34]: c.run_on_scheduler(lambda dask_scheduler: {ts.key: str(ts.processing_on) for ts in dask_scheduler.tasks.values() if ts.state in ("processing", "waiting")})
Out[34]:
{'partition_inputs-ea98882db0754e4497b1dcdd7d22e236': "<WorkerState 'tcp://10.124.10.46:43097', name: hardisty-2bab80b9-daskworkers-98d64ff7-28n8w, status: running, memory: 179, processing: 1>",
 "('_split-ef14f4c130766e401c7f9804e1c5a7bd', 11771)": 'None',
 "('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 11771)": "<WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closed, memory: 0, processing: 18>",
 "('_split-ef14f4c130766e401c7f9804e1c5a7bd', 7365)": 'None',
 "('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 7365)": "<WorkerState 'tcp://10.126.233.26:37341', name: 125, status: closed, memory: 0, processing: 23>",
 "('_split-ef14f4c130766e401c7f9804e1c5a7bd', 3225)": 'None',
 "('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 3225)": "<WorkerState 'tcp://10.126.167.29:36945', name: 232, status: closed, memory: 0, processing: 10>",
 "('_split-ef14f4c130766e401c7f9804e1c5a7bd', 7711)": 'None',
 "('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 7711)": "<WorkerState 'tcp://10.126.167.29:36945', name: 232, status: closed, memory: 0, processing: 10>",
 "('_split-ef14f4c130766e401c7f9804e1c5a7bd', 4873)": 'None',
 "('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 4873)": "<WorkerState 'tcp://10.127.22.29:43185', name: 412, status: closed, memory: 0, processing: 10>",
 "('_split-ef14f4c130766e401c7f9804e1c5a7bd', 3331)": 'None',
 "('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 3331)": "<WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closed, memory: 0, processing: 18>",
 "('_split-ef14f4c130766e401c7f9804e1c5a7bd', 11393)": 'None',
 "('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 11393)": "<WorkerState 'tcp://10.126.71.26:32909', name: 250, status: closed, memory: 0, processing: 10>",
 "('_split-ef14f4c130766e401c7f9804e1c5a7bd', 10716)": 'None',
 "('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 10716)": "<WorkerState 'tcp://10.126.71.26:32909', name: 250, status: closed, memory: 0, processing: 10>"}

Zooming in on the first closed worker 10.126.160.29:33011, relevant scheduler logs:

(.venv) ➜  model git:(master) ✗ kl hardisty-2bab80b9-daskscheduler-5c84ddcfd4-hfplp  | grep 10.126.160.29:33011
2022-05-03 02:05:00,218 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: undefined, memory: 0, processing: 0>
2022-05-03 02:05:00,219 - distributed.scheduler - INFO - Starting worker compute stream, tcp://10.126.160.29:33011
2022-05-03 02:07:38,344 - distributed.scheduler - INFO - Retiring worker tcp://10.126.160.29:33011
2022-05-03 02:07:38,684 - distributed.active_memory_manager - INFO - Retiring worker tcp://10.126.160.29:33011; 21 keys are being moved away.
2022-05-03 02:07:39,572 - distributed.active_memory_manager - INFO - Retiring worker tcp://10.126.160.29:33011; 21 keys are being moved away.
2022-05-03 02:07:42,554 - distributed.active_memory_manager - INFO - Retiring worker tcp://10.126.160.29:33011; 10 keys are being moved away.
2022-05-03 02:07:43,888 - distributed.scheduler - INFO - Closing worker tcp://10.126.160.29:33011
2022-05-03 02:07:43,888 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 13, processing: 16>
2022-05-03 02:07:43,888 - distributed.core - INFO - Removing comms to tcp://10.126.160.29:33011
2022-05-03 02:07:43,895 - distributed.scheduler - INFO - Retired worker tcp://10.126.160.29:33011
2022-05-03 02:07:51,117 - distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tcp://10.127.207.8:34749', name: 2314, status: running, memory: 11, processing: 5>, Got: <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 12, processing: 0>, Key: ('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 7489)
2022-05-03 02:07:51,118 - distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tcp://10.127.199.4:32777', name: 619, status: running, memory: 13, processing: 7>, Got: <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 14, processing: 0>, Key: ('_split-ef14f4c130766e401c7f9804e1c5a7bd', 7535)
2022-05-03 02:07:51,119 - distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tcp://10.127.7.24:33955', name: 1049, status: running, memory: 6, processing: 5>, Got: <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 14, processing: 0>, Key: ('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 12280)
2022-05-03 02:07:51,121 - distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tcp://10.127.113.32:41691', name: 572, status: running, memory: 8, processing: 5>, Got: <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 15, processing: 0>, Key: ('_split-ef14f4c130766e401c7f9804e1c5a7bd', 2679)
2022-05-03 02:07:51,121 - distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tcp://10.127.199.4:32777', name: 619, status: running, memory: 12, processing: 6>, Got: <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 15, processing: 0>, Key: ('_split-ef14f4c130766e401c7f9804e1c5a7bd', 6989)
2022-05-03 02:07:51,121 - distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tcp://10.127.133.31:42383', name: 2377, status: running, memory: 8, processing: 5>, Got: <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 15, processing: 0>, Key: ('_split-ef14f4c130766e401c7f9804e1c5a7bd', 844)
2022-05-03 02:07:51,121 - distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tcp://10.126.226.29:45983', name: 2094, status: running, memory: 7, processing: 6>, Got: <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 15, processing: 0>, Key: ('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 10595)
2022-05-03 02:07:51,123 - distributed.scheduler - INFO - Unexpected worker completed task. Expected: <WorkerState 'tcp://10.127.195.7:42145', name: 2338, status: running, memory: 5, processing: 5>, Got: <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 16, processing: 0>, Key: ('_split-ef14f4c130766e401c7f9804e1c5a7bd', 11009)
2022-05-03 02:07:51,123 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 16, processing: 0>
2022-05-03 02:07:51,128 - distributed.scheduler - INFO - Starting worker compute stream, tcp://10.126.160.29:33011
2022-05-03 02:07:55,963 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://10.126.160.29:33011', name: 77, status: closing_gracefully, memory: 5, processing: 0>
2022-05-03 02:07:55,963 - distributed.core - INFO - Removing comms to tcp://10.126.160.29:33011

And worker logs:

2022-05-03 02:04:58,377 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.126.160.29:45303'
2022-05-03 02:04:59,723 - distributed.worker - INFO -       Start worker at:  tcp://10.126.160.29:33011
2022-05-03 02:04:59,723 - distributed.worker - INFO -          Listening to:  tcp://10.126.160.29:33011
2022-05-03 02:04:59,723 - distributed.worker - INFO -          dashboard at:        10.126.160.29:39677
2022-05-03 02:04:59,723 - distributed.worker - INFO - Waiting to connect to: tcp://hardisty-2bab80b9-daskscheduler:8786
2022-05-03 02:04:59,723 - distributed.worker - INFO - -------------------------------------------------
2022-05-03 02:04:59,723 - distributed.worker - INFO -               Threads:                          4
2022-05-03 02:04:59,723 - distributed.worker - INFO -                Memory:                   7.82 GiB
2022-05-03 02:04:59,723 - distributed.worker - INFO -       Local Directory: /src/dask-worker-space/worker-7181cv4y
2022-05-03 02:04:59,723 - distributed.worker - INFO - -------------------------------------------------
2022-05-03 02:05:00,219 - distributed.worker - INFO -         Registered to: tcp://hardisty-2bab80b9-daskscheduler:8786
2022-05-03 02:05:00,220 - distributed.worker - INFO - -------------------------------------------------
2022-05-03 02:05:00,221 - distributed.core - INFO - Starting established connection
2022-05-03 02:07:02,854 - distributed.utils_perf - INFO - full garbage collection released 473.14 MiB from 28 reference cycles (threshold: 9.54 MiB)
2022-05-03 02:07:44,775 - distributed.worker - INFO - -------------------------------------------------
2022-05-03 02:07:46,212 - distributed.worker - INFO - Stopping worker at tcp://10.126.160.29:33011
2022-05-03 02:07:46,212 - distributed.worker - INFO - Not reporting worker closure to scheduler
2022-05-03 02:07:46,219 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting.  Status: Status.closing
2022-05-03 02:07:51,128 - distributed.worker - INFO -         Registered to: tcp://hardisty-2bab80b9-daskscheduler:8786
2022-05-03 02:07:51,129 - distributed.worker - INFO - -------------------------------------------------
2022-05-03 02:07:51,129 - distributed.core - INFO - Starting established connection
2022-05-03 02:07:52,639 - distributed.worker - ERROR - Exception during execution of task ('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 3331).
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/worker.py", line 3677, in execute
    result = await self.loop.run_in_executor(
  File "/usr/local/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 257, in run_in_executor
    return self.asyncio_loop.run_in_executor(executor, func, *args)
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 814, in run_in_executor
    executor.submit(func, *args), loop=self)
  File "/usr/local/lib/python3.9/site-packages/distributed/_concurrent_futures_thread.py", line 127, in submit
    raise RuntimeError("cannot schedule new futures after shutdown")
RuntimeError: cannot schedule new futures after shutdown
2022-05-03 02:07:53,622 - distributed.nanny - INFO - Worker closed
2022-05-03 02:07:53,623 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-05-03 02:07:54,097 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.126.160.29:45303'. Report closure to scheduler: None
2022-05-03 02:07:54,098 - distributed.dask_worker - INFO - End worker

Also finally got a successful cluster dump 🎉

cc @gjoseph92 @fjetter @crusaderky @mrocklin

bnaul commented 2 years ago

One other thing, there are a bunch of these KeyErrors in the scheduler logs

2022-05-03 02:08:02,526 - distributed.stealing - ERROR - 'tcp://10.126.167.29:36945'
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/stealing.py", line 242, in move_task_request
    self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://10.126.167.29:36945'
2022-05-03 02:08:02,801 - distributed.stealing - ERROR - 'tcp://10.126.167.29:36945'
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/stealing.py", line 466, in balance
    maybe_move_task(
  File "/usr/local/lib/python3.9/site-packages/distributed/stealing.py", line 369, in maybe_move_task
    self.move_task_request(ts, victim, thief)
  File "/usr/local/lib/python3.9/site-packages/distributed/stealing.py", line 242, in move_task_request
    self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://10.126.167.29:36945'

not for the specific worker I chose above, but for some of the other 🧟s

mrocklin commented 2 years ago

@gjoseph92 do you have time this week to take a look at this?

Somehow the scheduler has assigned a task to run on a closed WorkerState

gjoseph92 commented 2 years ago

Thanks @bnaul, I'll take a look this week.

gjoseph92 commented 2 years ago

Hey @bnaul—sorry to say, but I won't be able to get to this this week. https://github.com/dask/distributed/pull/6329 is proving to be harder to merge than expected, and Florian is out sick, so we're running behind. This will be next up for next week though.

gjoseph92 commented 2 years ago

@bnaul to confirm, what version of distributed is this?

mrocklin commented 2 years ago

I think that Brett has been on main for a little while (or main whenever this was)

On Mon, May 16, 2022 at 5:31 PM Gabe Joseph @.***> wrote:

@bnaul https://github.com/bnaul to confirm, what version of distributed is this?

— Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/6263#issuecomment-1128200904, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTH5AEMVZKM7Y2RJBSDVKLECPANCNFSM5U65NLKA . You are receiving this because you were mentioned.Message ID: @.***>

gjoseph92 commented 2 years ago

That's what I thought too, and why I wanted to know. There have been a lot of changes recently to the areas that this touch, so getting a specific commit would be helpful.

gjoseph92 commented 2 years ago

Dumping a ton of notes here from investigation. There are two major bugs here. I'll eventually open tickets for each.

1. After the scheduler tells a worker to close, it may reconnect to the scheduler.

This has a bit of overlap with #5480, and sort of #6341. I think that #6329 might alleviate one side, but there's a more general race condition that's probably been around for a long time.

  1. The RetireWorker policy finishes. Scheduler calls await self.close_worker.
  2. close_worker calls remove_worker. (It also redundantly enqueues a close message to the workerremove_worker is about to do the same—though I don't think this makes a difference here.)
  3. remove_worker does a bunch of stuff, including deleting the BatchedSend to that worker which has the {op: close} message queued on it. It does not wait until the BatchedSend queue is flushed.
  4. remove_worker removes the WorkerState entry for that worker
  5. At this point, the scheduler has removed all knowledge of the worker. However, the close message hasn't even been sent to it yet—it's still queued inside a BatchedSend, which may not send the message for up to 5ms more. And even after the message has been sent, then Worker.close still has to get scheduled on the worker event loop and run (which could take a while https://github.com/dask/distributed/issues/6325). There are multiple sources of truth for when a worker is gone. In some places, it's whether addr in Scheduler.workers, or addr in Scheduler.stream_comms. In others, it's whether a comm to that worker is closed. The entry being removed from the dict and the comm being closed are disjoint events.

    Thus, between when Scheduler.remove_worker ends and all comms to the worker comm actually close, we are in a degenerate state and exposed to race conditions. The scheduler forgets about the worker before closing communications to that worker. Or even confirming that the worker has received the message to close. Explicitly flushing and closing the BatchedSend would alleviate this, though not resolve it: while waiting for our send side to close, we could still receive messages from the now-removed worker.

    Simply moving the flush-and-close to the top of Scheduler.remove_worker—before anything else happens—I think would fix the problem. Then, we wouldn't be removing state related to the worker until we were guaranteed the worker connection was closed.

  6. After remove_worker has run, but before the close message actually gets sent over the BatchedSend / Worker.close starts running, the worker send another heartbeat to the scheduler.
  7. From the scheduler's perspective, this worker doesn't exist, so it replies "missing" to the worker.
  8. Though the worker is in state closing_gracefully, it still tries to re-register.
    1. Fun fact (not relevant to this, but yet another broken thing in BatchedSend land): this is going to call Worker.batched_stream.start with a new comm object, even though Worker.batched_stream is already running with the previous comm object. This the double-start bug I pointed out in https://github.com/dask/distributed/pull/6272/files#r866082750. This will swap out the comm that the BatchedSend is using, and launch a second background_send coroutine. Surprisingly, I think having multiple background_sends racing to process one BatchedSend is still safe, just silly.
    2. The BatchedSend has now lost its reference to the original comm. However, one Worker.handle_scheduler is still running with that original comm, and now one is running with the new comm too. Since there's still a reference to the old comm, it isn't closed.
    3. Because the worker's old comm wasn't closed, the old Scheduler.handle_worker is still running. Even though, from the scheduler's perspective, the worker it's supposed to be handling doesn't exist anymore. Yay https://github.com/dask/distributed/issues/6201 — if we had a handle on this coroutine, we could have cancelled it in remove_worker.
  9. The scheduler handles this "new" worker connecting. (This passes through the buggy code of #6341, though I don't think that actually makes a difference in this case.) This is where all of the "Unexpected worker completed task" messages come from, followed by another "Register worker" and "Starting worker compute stream".
  10. Eventually, Worker.close actually closes (one of) its batched comms to the scheduler.
  11. This triggers a second remove_worker on the scheduler, removing the entry for the "new" worker.
    1. There's probably still a Scheduler.handle_worker running for the old comm too. I presume it eventually stops when the worker actually shuts down and severs the connection? When it does, it probably won't run remove_worker another time, because the address is already gone from Scheduler.stream_comms.

Discussion

Finding yet another bug in worker reconnection might feel like another vote for https://github.com/dask/distributed/issues/6350.

The real bug here is how Scheduler.remove_worker puts things in a broken state every time it runs (comms still exist to a worker, WorkerState doesn't). And whether we do #6350 or #6329, we need to fix this either way.

It shouldn't be that hard to fix. (It may be harder to ensure it doesn't get broken again in the future, though—the design is still brittle.) Scheduler.remove_worker just needs to await self.stream_comms[address].close() before doing any other state manipulation. By closing the underlying comm, this will also terminate the handle_worker, so incoming messages won't be processed.

gjoseph92 commented 2 years ago

2. If a worker reconnects, a task can be stolen to the WorkerState object for the old, disconnected worker.

  1. Stealing decides to move a task to worker X.
    1. It queues a steal-request to worker Y (where the task is currently queued), asking it to cancel the task.
    2. Stores a reference to the victim and thief WorkerStates (not addresses) in WorkStealing.in_flight
  2. Worker X gets removed by the scheduler.
  3. Its WorkerState instance—the one currently referenced WorkStealing.in_flight—is removed from Scheduler.workers.
  4. Worker X heartbeats to the scheduler, reconnecting (bug described above).
  5. A new WorkerState instance for it is added to Scheduler.workers, at the same address. The scheduler thinks nothing is processing on it.
  6. Worker Y finally replies, "hey yeah, it's all cool if you steal that task".
  7. move_task_confirm handles this, and pops info about the stealing operation from WorkStealing.in_flight.
  8. This info contains a reference to the thief WorkerState object. This is the old WorkerState instance, which is no longer in Scheduler.workers.
  9. The thief's address is in scheduler.workers, even though the theif object isn't.
  10. The task gets assigned to a worker that, to the scheduler, no longer exists.
  11. When worker X actually shuts itself down, Scheduler.remove_worker goes to reschedule any tasks it's processing. But it's looking at the new WorkerState instance, and the task was assigned to the old one, so the task is never rescheduled.

From @bnaul's cluster dump:

# scheduler event log

# worker joins
- - 2022-05-02 20:06:13.639737
  - action: add-worker
    worker: tcp://10.126.71.26:32909

# 1.5min later, `retire_workers` removes it
- - 2022-05-02 20:07:43.847567
  - action: remove-worker
    processing-tasks:
        ... # there are some, doesn't matter what they are
    worker: tcp://10.126.71.26:32909

# 10s later, the heartbeat-reconnect race condition makes it reconnect
- - 2022-05-02 20:07:52.530403
  - action: add-worker
    worker: tcp://10.126.71.26:32909

# 2s later
# at some point in the past (before the disconnect-reconnect),
# we had planned to steal a task from some other worker and run it on this one.
# the other worker just replied, "ok you can steal my task".
# stealing now sets the task to be processing on our worker in question.
# however, `move_task_confirm` has a reference to the old, pre-disconnect `WorkerState` instance.
# it passes checks, because the _address_ is still in Scheduler.workers.
stealing:
- - 2022-05-02 20:07:54.831012
  - - confirm
    - ('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 10716)  # <-- stuck task
    - ready
    - tcp://10.125.19.19:37649
    - tcp://10.126.71.26:32909  # <-- our worker in question is the thief
    - steal-11761

# 0.4s later
# worker shuts itself down, closing comms and triggering another `remove_worker`.
# but it appears to not be running anything, even though we know it just stole a task!
- - 2022-05-02 20:07:55.195962
  - action: remove-worker
    processing-tasks: {}  # <-- empty, because this is showing the new WorkerState object. task above got assigned to old WorkerState object
    worker: tcp://10.126.71.26:32909
# Scheduler.tasks
('read-parquet-14e1bf051c27f27510e4873d5022c6a8', 10716):  # <-- stuck task
  state: processing
  processing_on: '<WorkerState ''tcp://10.126.71.26:32909'', name: 250, status: closed,
    memory: 0, processing: 10>'
# It's processing on a WorkerState that's closed. That WorkerState has managed to collect 10 processing tasks.
# Same address as our logs above---but it's a different WorkerState instance! (And doesn't exist in `Scheduler.workers`.)

Discussion

This should be a relatively simple fix in work-stealing. Either:

  1. Store worker addresses in in_flight, not WorkerState instances.
  2. Validate that self.scheduler.workers.get(thief.address) is thief.

Even with that, I'm wary of having a worker disconnect and reconnect, and still trusting our decision to move a task to that worker.

But more broadly, it's interesting because it's essentially a classic use-after-free bug. You can't dereference null pointers in Python, but you can very easily work with objects that have logically been freed.

It points to how reference leaks like https://github.com/dask/distributed/issues/6250 aren't just a memory issue—they can also indicate correctness issues. If something is keeping a reference alive to an object that should be freed, there's a chance it might actually use that reference! Python, unfortunately, makes it very easy to be fast and loose with object lifetimes like this.

bnaul commented 2 years ago

Thanks @gjoseph92, super interesting! And yes I believe I had just bumped to 2022.5.0 but if not it was main from probably the day of release.

gjoseph92 commented 2 years ago

Hey @bnaul, could you try things out with the latest release (2022.5.1)? The buggy codepath of https://github.com/dask/distributed/issues/6356 / https://github.com/dask/distributed/issues/6392 is still around, but with worker reconnect removed (https://github.com/dask/distributed/pull/6361), I think it's much, much less likely to be triggered.

I'd be surprised if you still see this issue anymore. Though I'd like to keep it open until https://github.com/dask/distributed/issues/6356 is actually fixed.

rbavery commented 2 years ago

I'm still seeing an issue like this for the May, April, and March releases of Dask and Distributed. about ~300 tasks are processed across my 32 core machine and I see on htop that cores are firing 100%. Then core use goes down to near 0 and tasks are stuck in "processing"

Screen Shot 2022-05-26 at 3 34 26 PM

I'm testing by reinstalling both dask and distributed from pip, and wasn't encountering this error back in April. is there something that would cause this error to occur across dask versions if it was run with the version with the bug?

EDIT: bringing num_workers down to less than my number of cores rather than letting the default be set solves my issue, there are no more pauses

gjoseph92 commented 2 years ago

@rbavery generically, what you're describing sounds like a deadlock, but there's not enough information here to be sure that you're seeing the deadlock referenced in this particular issue. (There are a number of different bugs which can cause deadlocks; the overall symptom of "tasks stuck in processing" is the same.)

Could you open a new issue for this, and include any logs from the scheduler and workers you can get?

Also, could you make sure you're using the latest version (2022.5.1 or 2022.5.2)? We fixed one of the more common sources of deadlocks in the release 2 days ago. (Though if you're running a local cluster, I would not expect this one to affect you.)

mrocklin commented 2 years ago

Anything interesting in logs? If you're running with a local cluster you might want the flag silence_logs=False

client = Client(silence_logs=False)
rbavery commented 2 years ago

Thanks for the tips @gjoseph92 and @mrocklin . I tested 2022.5.2 but still ran into this. I'll open a new issue.

fjetter commented 2 years ago

@gjoseph92 IIUC we currently assume this should be closed by https://github.com/dask/distributed/issues/6356?

gjoseph92 commented 2 years ago

By whatever fixes #6356, yes