dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.6k stars 1.71k forks source link

Local scheduler parameter `chunksize` #7510

Open jakirkham opened 3 years ago

jakirkham commented 3 years ago

Recently we added the ability to batch submissions with the local scheduler using chunksize as users have voiced interest ( https://github.com/dask/dask/issues/6220#issuecomment-648878068 ). Mostly this is unused (so a no-op) with the exception of the multiprocessing case where we do some batching to amortize the cost of process spin up. It's also not that visible in the documentation or elsewhere. That said, we might want to share this more broadly.

One question that comes up with this is whether we like the parameter name, chunksize (taken from concurrent.futures and used elsewhere ( https://github.com/dask/distributed/pull/3650 )) or if that's confusing and we want to use something else like batchsize. Another question is this sits at the top-level config (similar to other things like pool and num_workers), is that where we want this? Lastly is this parameter useful outside the local scheduler (like would Distributed want to use this)

jakirkham commented 3 years ago

cc @tomwhite @jcrist (who may have thoughts on these questions)

GenevieveBuckley commented 3 years ago

One question that comes up with this is whether we like the parameter name, chunksize (taken from concurrent.futures and used elsewhere ( dask/distributed#3650 )) or if that's confusing and we want to use something else like batchsize.

I would find this a little confusing, and might lean towards batchsize.

Another question is this sits at the top-level config (similar to other things like pool and num_workers), is that where we want this? Lastly is this parameter useful outside the local scheduler (like would Distributed want to use this)

I don't think I can comment sensibly on this, but perhaps @tomwhite or @jcrist have thoughts.

ian-r-rose commented 2 years ago

@jakirkham is this something you are still interested in pursuing? I wasn't even aware this was an option, though I agree with @GenevieveBuckley that chunksize is a bit overloaded and confusing.

jakirkham commented 2 years ago

This issue was mainly to get feedback on these config values and whether we could handle them better. Given how quiet this issue has been, am guessing most people are not thinking about this.

It is. However the chunksize name comes from the concurrent.futures API, which is what Dask is using under-the-hood. That said, no strong feelings on either name.

Given how long it has been there, we would need to go through a deprecation cycle to change it.