Closed ohashmi1 closed 3 weeks ago
You might try looking at what the workers are working on using the Client.call_stack method or looking in the worker pages in the "Info" tab of the dashboard.
As always, if you're able to provide a minimal reproducible example that's often a helpful approach. See this blog
I look at the info tab but it just shows that none of them are processing anything. I am pretty new to dask, don't really know how to debug. Not sure what to look for in call stack. Could it because I persist dataframes in memory? The memory is never 100% used tho. It only happens at the very end, workers progressively do less work as I advance my computation.
There isnt a whole lot of code i can reproduce. It also only happens when i have a large dataset. It works fine on smaller sets.
In that case I recommend trying to produce a minimal reproducible example so that maintainers are able to help you
On Fri, Jul 12, 2019 at 11:33 AM ohashmi1 notifications@github.com wrote:
I look at the info tab but it just shows that none of them are processing anything. I am pretty new to dask, don't really know how to debug. Not sure what to look for in call stack. Could it because I persist dataframes in memory? The memory is never 100% used tho. It only happens at the very end, workers progressively do less work as I advance my computation.
There isnt a whole lot of code i can reproduce. It also only happens when i have a large dataset. It works fine on smaller sets.
[image: Screen Shot 2019-07-12 at 12 30 27 PM] https://user-images.githubusercontent.com/1936054/61143672-e37f3c80-a4a0-11e9-98d3-378c1e750404.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2835?email_source=notifications&email_token=AACKZTHAPA5Y2GR4DZAWPD3P7CW47A5CNFSM4ICMCFRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ2H5UA#issuecomment-510951120, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTF26EJFHODV32FNB2DP7CW47ANCNFSM4ICMCFRA .
I think I'm getting something similar. I ended up with workers that looked like below.
One worker has collected a whole bunch of tasks that are all in processing but not doing anything and then nothing else is processing.
Manually using client.retire_worker helped to unstick things for me.
There were definitely tasks on the queue for the worker that I retired that had been put there after another worker died as one I clicked on had been marked "suspicious: 1" which i believe happens when a task was running previously on a killedworker.
That's odd. Do you happen to have logs from that worker?
I imagine that this is hard, but if anyone has a reproducible example, that would be helpful.
On Thu, Jul 25, 2019 at 9:06 PM Sarah Bird notifications@github.com wrote:
I think I'm getting something similar. I ended up with workers that looked like below.
One worker has collected a whole bunch of tasks that are all in processing but not doing anything and then nothing else is processing.
Manually using client.retire_worker helped to unstick things for me.
There were definitely tasks on the queue for the worker that I retired that had been put there after another worker died as one I clicked on had been marked "suspicious: 1" which i believe happens when a task was running previously on a killedworker.
[image: Screenshot from 2019-07-25 22-46-56] https://user-images.githubusercontent.com/1796208/61925123-72279b00-af30-11e9-9566-b1233e81ec65.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2835?email_source=notifications&email_token=AACKZTBFT2OF6Y6JMLRXWETQBJZ2TA5CNFSM4ICMCFRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23N3QY#issuecomment-515300803, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTH37N2VP2BAUBUFR7LQBJZ2TANCNFSM4ICMCFRA .
Also, getting the output of Client.call_stack
(also available through the
worker info pages) would be helpful in determining what that odd worker is
up to.
On Fri, Jul 26, 2019 at 7:35 AM Matthew Rocklin mrocklin@gmail.com wrote:
That's odd. Do you happen to have logs from that worker?
I imagine that this is hard, but if anyone has a reproducible example, that would be helpful.
On Thu, Jul 25, 2019 at 9:06 PM Sarah Bird notifications@github.com wrote:
I think I'm getting something similar. I ended up with workers that looked like below.
One worker has collected a whole bunch of tasks that are all in processing but not doing anything and then nothing else is processing.
Manually using client.retire_worker helped to unstick things for me.
There were definitely tasks on the queue for the worker that I retired that had been put there after another worker died as one I clicked on had been marked "suspicious: 1" which i believe happens when a task was running previously on a killedworker.
[image: Screenshot from 2019-07-25 22-46-56] https://user-images.githubusercontent.com/1796208/61925123-72279b00-af30-11e9-9566-b1233e81ec65.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2835?email_source=notifications&email_token=AACKZTBFT2OF6Y6JMLRXWETQBJZ2TA5CNFSM4ICMCFRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23N3QY#issuecomment-515300803, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTH37N2VP2BAUBUFR7LQBJZ2TANCNFSM4ICMCFRA .
I’ve also had this, and in my case it seemed to be some sort of race condition in the scheduler. For me, it was fixed by moving the scheduler from toolz to cytoolz, which also seemed to imply it was some sort of timing issue.
moving the scheduler from toolz to cytoolz
@rbubley can you share info on how you did that?
On July 28, 2019 11:56:51 AM CDT, Russ Bubley notifications@github.com wrote:
I’ve also had this, and in my case it seemed to be some sort of race condition in the scheduler. For me, it was fixed by moving the scheduler from toolz to cytoolz, which also seemed to imply it was some sort of timing issue.>
-- > You are receiving this because you commented.> Reply to this email directly or view it on GitHub:> https://github.com/dask/distributed/issues/2835#issuecomment-515778163
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
@birdsarah, it is literally as simple as a pip/conda install of cytoolz: distributed attempts to import cytoolz in preference to toolz.
(Various bits of the distributed library code uses this pattern:
try:
from cytoolz import reduce
except ImportError:
from toolz import reduce
)
In a recent run with the same problem. I had about 450 jobs of 100k plus remaining and listed as ready but they weren't in any worker. Call to client.call_stack returned empty dict.
I discovered that, presumably due to some packaging conflicts, I had both toolz and cytoolz installed. I have removed toolz and am going to see what happens.
Oh....okay so the option is toolz only or toolz and cytoolz. (dask won't run without plain toolz)
Hold moly that seems to have done the trick. Have only got a couple of results and so am loathed to get too excited, but I think you might be on for hero of the week @rbubley.
@mrocklin I also observed the loads redistributing as you described once I removed cytoolz.
Just so I'm clear, was there a mismatch in environments where some workers/clients/schedulers had cytoolz and some didn't, or were environments homogeneous before, and another homogeneous change in those environments helped restore things?
@mrocklin, in my case, they were homogeneous (no cytoolz on client/scheduler/worker) and I saw the problem; the problem was fixed when it became inhomogeneous (cytoolz only on scheduler). But my understanding is that cytoolz is supposed to have the same results as toolz (up to the documented return values; non-guaranteed orderings can change), and conceptually this shouldn't be a library where homogeneity is required. (Although something subtle around orderings is always possible.)
What we should expect is that changes between toolz and cytoolz affects timings, and could therefore surface previously unobserved race conditions. Given the observed effects, i.e. some workers being idle when they shouldn't be, I would guess that the issue is with the stealing code - perhaps some series of actions is assumed to be atomic when it isn't.
Things were homogenous. What I meant was that on initial environment creation dependency management had meant that both toolz and cytoolz were created. That was then packaged with conda pack
and shipped to all workers. When I removed cytoolz, and again repackaged with conda pack
and restarted DaskYarn cluster things improved.
So @rbubley 's situation improved when he added cytoolz and @birdsarah 's situation improved when she removed it?
To be clear I couldn't test having only cytoolz in my environment. dask / daskyarn wouldn't import.
Edit: I couldn't readily test.
I have been reliably removing cytoolz from my environment for a while.
Just to note that today I experienced this issue for the first time in a while, with cytoolz removed,. It may be unrelated, but thought I'd document. It was on write_parquet
tasks which seem to have completed but not registered as such. When I retire the affected worker the job can finish with no apparent loss of data.
I'm not 100% sure if I am seeing the same thing discussed here, but often I see one or a handful of jobs stalled in "processing" on some number of workers. If I go to the task in the dashboard I see, e.g.
Status processing
Processing on tls://10.128.2.162:40554
Call stack <link>
Priority (0, 1, 360)
Retries 3
If I click the link to the call stack I see Task not actively running. It may be finished or not yet started
which seems odd. If I retire the worker (e.g. in this case, client.retire_workers(["tls://10.128.2.162:40554"])
), then the task gets unstuck and the job completes.
I am having the same problem.
Some task (of several similar ones launched with client.map so nothing particular) gets stuck as described above by others, while the worker or the scheduler is not using much cpu and trying to get the call stack reports the task as not running. Shutting the worker seems to help.
It happens quite frequently to the point that dask is hard to use without manual intervention. Is there any debugging information I can provide?
The environment I have in all my workers is the latest dask from conda forge, which installs cytoolz as a dependency.
After some investigation I think that what is happening is that some memory limit (80%?) is hit and that somehow causes tasks to be marked as started but not actually be processed. This can happen because of e.g. #3530.
Here a way to reproduce the problem reliably on my machine:
from dask.distributed import Client
N = 1_000_000_000
def leak(mem):
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc.restype = ctypes.c_void_p
x = libc.malloc(mem)
x = ctypes.cast(x, ctypes.c_void_p)
libc.memset(x, 1, mem)
def f():
return "Hello"
if __name__ == '__main__':
c = Client(threads_per_worker=1, n_workers=1, memory_limit=N)
leak_future = c.submit(leak, int(N * 0.81))
# This is needed here for some reason
import time
time.sleep(1)
future = c.submit(f)
leak_future.result()
print("Leaked memory")
future.result()
print("Never reached this. Cluster deadlocked")
Running the script above I see a continous stream of warnings
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 864.24 MB -- Worker memory limit: 1000.00 MB
but I do not see them when the cluster is similarly deadlocked with the production code.
Hi I have a similar issue as reported above.
I am processing 800-1000 3D microscopy images using dask/dask-jobqueue
on an htcondor
managed cluster. Each 3D image is processed in parallel (embarrassingly parallel application). Each image is processed in a separated job
managed by a worker
. The min and max number of jobs/workers
is regulated by adaptive scaling (min, max). The expected processing time of a task is quite long (~1hr)
The processing time of each task increases over time and usually the last tasks take hrs (~5-10hrs) to be completed or the processing stalls. The tasks that are run for longer time or stall are random (different images can be processed last). This is similar to what is reported here: #2835 but having ctoolz
didn't solve the issue
The task log
for this long running tasks shows that the task status cycle between waiting --> running
to processing-->released
many many times. The processing resources aren't an issue. In some cases an error related to connection to the scheduler is printed (see below) so it seems that the worker (still up in the dashboard) looses the connection to the scheduler (temporarily)
distributed.utils - ERROR - Timed out during handshake while connecting to tcp://192.168.0.2:35688 after 10 s [112/221]
Traceback (most recent call last):
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/home/simone/mini/envs/test_d/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/utils.py", line 655, in log_errors
yield
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/scheduler.py", line 3515, in retire_workers
await self.replicate(
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/scheduler.py", line 3281, in replicate
results = await asyncio.gather(
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect
comm = await connect(
File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
raise IOError(
OSError: Timed out during handshake while connecting to tcp://192.168.0.2:35688 after 10 s
It is difficult for me to provide a working minimal example because the issue happens only when the processing has been going on for quite a while (hrs) and not always after the same amount of time.
Any hint to what to look for in order to run a proper debugging will be helpful
Thanks.
This issue still requires more info but there has been no activity here for many years, so I'm going to close it out.
Hi I am using dask distributed with 300 workers. I have a cluster of 32 machines with over 300 cpus and I want the computation to be distributed across all workers. Currently when i run my job, it works fairly well and distributes all the tasks, however towards the end, the workers stop working and there is only one or two that has a high cpu usage.
Generally I have a 8000 timeseries that I want to process and I want the computation to be distributed so I use dask delayed, and then call compute on the list.
I have the following images from the dashboard. I just don't understand why it stops distributing at the very end.
Any help would be appreciated!