dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

Support `executor="processes"` and the like to Worker #5319

Open gjoseph92 opened 3 years ago

gjoseph92 commented 3 years ago

Being able to specify the worker's executor(s) is a neat feature that's probably under-used in part because it's currently a somewhat cumbersome API: you need to create and pass in the executor instance(s) you want to use.

On cloud deployments or using a spec cluster, this is awkward. You currently have write a worker plugin to do this (see https://youtu.be/vF2VItVU5zg?t=515). Instead, it would be nice if you could just do:

If this is done, some other cleanup should follow:

gjoseph92 commented 2 years ago

Another thing we should consider cleaning up when we do this is https://github.com/dask/distributed/pull/693. The current memory_limit = MEMORY_LIMIT * min(1, nthreads / total_cores) behavior is built around the assumption that you're running dask-worker multiple times on the same machine to get multiple processes. If dask-worker --nprocs actually just made a ProcessPoolExecutor (debatable if that should actually be the behavior) on one worker, then we wouldn't need to leave room for others.

gjoseph92 commented 2 years ago

We should consider inter-process communication costs of using ProcessPoolExecutors though. This would make https://github.com/dask/distributed/issues/4497 a lot more relevant. Since dask-worker --nprocs currently creates multiple complete worker processes, each with their own data, the scheduler essentially handles locality and prefers to run a task in the same process as its dependents (since each process is a separate worker).

Naively swapping ProcessPoolExecutor for ThreadPoolExecutor could lead to terrible performance, as data gets serialized back and forth between worker.data and every task. Worker architecture would need to change quite a bit, either having data be shared memory (maybe more straightforward, but still would probably involve tasks copying results), or having each process have its own process-local data store, some simple locality-based task scheduling on the Worker itself, and logic for moving data between processes (and workers) as necessary.

mrocklin commented 2 years ago

I suspect that most folks who want a process pool are dealing with applications that are compute heavy rather than data heavy. That's just a guess though, and certainly not universal.

On Fri, Dec 17, 2021 at 12:35 PM Gabe Joseph @.***> wrote:

We should consider inter-process communication costs of using ProcessPoolExecutors though. This would make #4497 https://github.com/dask/distributed/issues/4497 a lot more relevant. Since dask-worker --nprocs currently creates multiple complete worker processes, each with their own data, the scheduler essentially handles locality and prefers to run a task in the same process as its dependents (since each process is a separate worker).

Naively swapping ProcessPoolExecutor for ThreadPoolExecutor could lead to terrible performance, as data gets serialized back and forth between worker.data and every task. Worker architecture would need to change quite a bit, either having data be shared memory (maybe more straightforward, but still would probably involve tasks copying results), or having each process have its own process-local data store, some simple locality-based task scheduling on the Worker itself, and logic for moving data between processes (and workers) as necessary.

— Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/5319#issuecomment-996935278, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFJ7XPM2PU6F3HFYNTURN7G5ANCNFSM5EA6QEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>