Open jakirkham opened 3 years ago
cc @tomwhite @jcrist (who may have thoughts on these questions)
One question that comes up with this is whether we like the parameter name,
chunksize
(taken fromconcurrent.futures
and used elsewhere ( dask/distributed#3650 )) or if that's confusing and we want to use something else likebatchsize
.
I would find this a little confusing, and might lean towards batchsize
.
Another question is this sits at the top-level config (similar to other things like
pool
andnum_workers
), is that where we want this? Lastly is this parameter useful outside the local scheduler (like would Distributed want to use this)
I don't think I can comment sensibly on this, but perhaps @tomwhite or @jcrist have thoughts.
@jakirkham is this something you are still interested in pursuing? I wasn't even aware this was an option, though I agree with @GenevieveBuckley that chunksize
is a bit overloaded and confusing.
This issue was mainly to get feedback on these config values and whether we could handle them better. Given how quiet this issue has been, am guessing most people are not thinking about this.
It is. However the chunksize
name comes from the concurrent.future
s API, which is what Dask is using under-the-hood. That said, no strong feelings on either name.
Given how long it has been there, we would need to go through a deprecation cycle to change it.
Recently we added the ability to batch submissions with the local scheduler using
chunksize
as users have voiced interest ( https://github.com/dask/dask/issues/6220#issuecomment-648878068 ). Mostly this is unused (so a no-op) with the exception of themultiprocessing
case where we do some batching to amortize the cost of process spin up. It's also not that visible in the documentation or elsewhere. That said, we might want to share this more broadly.One question that comes up with this is whether we like the parameter name,
chunksize
(taken fromconcurrent.futures
and used elsewhere ( https://github.com/dask/distributed/pull/3650 )) or if that's confusing and we want to use something else likebatchsize
. Another question is this sits at the top-level config (similar to other things likepool
andnum_workers
), is that where we want this? Lastly is this parameter useful outside the local scheduler (like would Distributed want to use this)