Closed jcrist closed 7 years ago
We might also consider adopting a convention like this within Dask. The term get
is a pretty bad name. It actually comes all the way from the first commit
Ok, I've cleaned this up and added support for n_jobs
, which has been asked for in the past.
A few potential issues:
This makes it a bit tricky to provide good error messages on bad parameters, since we just pass all unrecognized values to Client
. I've tried to produce a nice error message from this, but there may be a better.
I'm also not 100% pleased with the scheduler names. Valid names are:
threading
multiprocessing
sync
The first two are good, and match joblib but not the dask docs (we say threaded
instead of threading
). The third is an abbreviation, which may be undesired. Perhaps synchronous
(a bit long though)? The equivalent in joblib is sequential
. I don't think that's a good name here, as we don't enforce an ordering of execution (which sequential would seem to indicate).
We also have a different default (threading, n_jobs=cpu_count()
) than scikit-learn (processes, n_jobs=1
). I think this makes sense though.
I think that following joblib over dask names (like threading over threaded) makes sense for this interface. You could also support aliases so that all of sync, sequential, synchronous would map to the same value.
Actually, I like this idea because then we can establish these names in more parts of Dask where we might prefer threaded over threading.
scheduler_types = {
'sync': 'synchronous',
'sequential': 'synchronous', # from joblib
'threading': 'threaded',
...
}
scheduler = scheduler_types.get(scheduler, scheduler)
I would like to give people the option to provide names rather than get=
functions.
I liked the alias idea and added it. I think this is good to go. Merging.
Allow specifying the scheduler by name instead of passing in the
get
function directly. Scheduler can be one of:dask.distributed.Client
.Not fully set on this yet.
Pros:
Cons:
If we go this route, then I'd also add
n_jobs
as a parameter (matching scikit-learn), which would specifynum_workers
for the threading and multiprocessing schedulers, and be ignored by the others. Might also maken_jobs=1
for all but distributed result in the synchronous scheduler. Downside of supportingn_jobs
here is we'd probably want the default to match what dask does (n_jobs = cpu_count()
) instead of what scikit-learn does (n_jobs=1
). I'm fine with this, but it is a difference.If we don't go this route, then I might add a
scheduler_kwargs
parameter instead, which would be forwarded to theget
call. Not sure if any of the other keyword arguments would prove useful for this library though.