Open dyang37 opened 2 years ago
Some further debugging shows that the worker died because the scheduler fails to hear from a worker for a certain period of time (300 sec). According to this source (https://github.com/dask/distributed/issues/6324), this occurs because the distributed tasks hold the GIL for long periods of time. Normally pure python code is unlikely hold GIL for too long, but since our code involves Cython, and therefore this created the problem.
I started seeing jobs repeatedly getting canceled after we upgraded the dask version. This also agrees to the observations from the link above, which states that this behavior is observed in dask.distributed==2022.5.0.
You can see the PR of dask source code that caused our issue here: https://github.com/dask/distributed/pull/6200
A quick fix for now is simply to downgrade dask version.
A more elegant fix could be either 1. Figure out a way for Cython to release GIL, or 2. Increase default timeout interval. Currently it is 300 sec.
Unfortunately there does not seem to be an easy fix. Both solutions requires some work around.
Here are the error logs:
2022-07-27 00:26:50,524 - distributed.scheduler - WARNING - Worker failed to heartbeat within 300 seconds. Closing: <WorkerState 'tcp://172.18.33.46:32965', name: SLURMCluster-4, status: running, memory: 12, processing: 3>
2022-07-27 00:26:50,525 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.18.33.46:32965', name: SLURMCluster-4, status: running, memory: 12, processing: 3>
2022-07-27 00:26:50,525 - distributed.core - INFO - Removing comms to tcp://172.18.33.46:32965
Regarding fix method 2, increasing default timeout interval, here's how to do it:
In the jobqueue config file (usually ~/.config/dask/jobqueue.yaml
), add worker-ttl: null
under scheduler
.
Now the scheduler no longer kills processes!!
This is basically reverting Dask's update here: https://github.com/dask/distributed/pull/6200/commits/d94ab9a88ff9076915ff1e744cc735ee380d31ce
Diyu, I don't totally understand this, but can you make some change, perhaps to the timeout interval, that fixes the problem and then do a PR?
Diyu, I don't totally understand this, but can you make some change, perhaps to the timeout interval, that fixes the problem and then do a PR?
Sure. Basically what happened is that the scheduler does not hear back from the computation node for a long time (300 sec), and the schedule thinks the node is dead, and cancels the node, while in fact the node is still actively performing computation. The fix is to set this timeout period to be infinity instead of 300 sec (default value of dask).
This fix requires the user to manually change the timeout interval value inside the Dask config file (which usually stores in the home directory ~/.config/dask/jobqueue.yaml
).
Perhaps we can add the modified config file into mbircone code base (for example, create a new directory mbircone/configs/dask/
, and put the config file in this directory)?
Meanwhile I'll see whether there's a way to directly pass timeout as an argument to our function multinode.get_cluster_ticket()
. If so then we can simply set this argument for the user, and the user do not need to manually change the config file.
Will do a PR on this.
Recently I've been running into the issue of nodes get canceled when performing multi-node computation with dask. Here's the error log from a canceled node:
I can confirm that this is not due to small
death_timeout
duration, as I setdeath_timeout
to be 1200 sec, while the node cancelation happens rather early (~5mins after I got the nodes).Furthermore, I observed that a large chunk of the multi-node jobs gets canceled: