Add `scheduler` parameter

jcrist commented 7 years ago

Allow specifying the scheduler by name instead of passing in the get function directly. Scheduler can be one of:

None, to use the global scheduler, falls back to threading if not set (default)
A string specifying the scheduler name ("threading", "multiprocessing", or "sync")
A scheduler get function
Anything that can be passed to dask.distributed.Client.

Not fully set on this yet.

Pros:

Easier for less dask-savy users to specify different schedulers, feels a bit more like joblib.

Cons:

Doesn't match how schedulers are normally set in dask. Disconnect may feel foreign.

If we go this route, then I'd also add n_jobs as a parameter (matching scikit-learn), which would specify num_workers for the threading and multiprocessing schedulers, and be ignored by the others. Might also make n_jobs=1 for all but distributed result in the synchronous scheduler. Downside of supporting n_jobs here is we'd probably want the default to match what dask does (n_jobs = cpu_count()) instead of what scikit-learn does (n_jobs=1). I'm fine with this, but it is a difference.

If we don't go this route, then I might add a scheduler_kwargs parameter instead, which would be forwarded to the get call. Not sure if any of the other keyword arguments would prove useful for this library though.

mrocklin commented 7 years ago

We might also consider adopting a convention like this within Dask. The term get is a pretty bad name. It actually comes all the way from the first commit

jcrist commented 7 years ago

Ok, I've cleaned this up and added support for n_jobs, which has been asked for in the past.

A few potential issues:

This makes it a bit tricky to provide good error messages on bad parameters, since we just pass all unrecognized values to Client. I've tried to produce a nice error message from this, but there may be a better.
I'm also not 100% pleased with the scheduler names. Valid names are:
- threading
- multiprocessing
- sync
The first two are good, and match joblib but not the dask docs (we say threaded instead of threading). The third is an abbreviation, which may be undesired. Perhaps synchronous (a bit long though)? The equivalent in joblib is sequential. I don't think that's a good name here, as we don't enforce an ordering of execution (which sequential would seem to indicate).
We also have a different default (threading, n_jobs=cpu_count()) than scikit-learn (processes, n_jobs=1). I think this makes sense though.

mrocklin commented 7 years ago

I think that following joblib over dask names (like threading over threaded) makes sense for this interface. You could also support aliases so that all of sync, sequential, synchronous would map to the same value.

Actually, I like this idea because then we can establish these names in more parts of Dask where we might prefer threaded over threading.

scheduler_types = {
  'sync': 'synchronous',
  'sequential': 'synchronous',  # from joblib
  'threading': 'threaded',
  ...
}

scheduler = scheduler_types.get(scheduler, scheduler)

mrocklin commented 7 years ago

I would like to give people the option to provide names rather than get= functions.

jcrist commented 7 years ago

I liked the alias idea and added it. I think this is good to go. Merging.

dask / dask-searchcv

Add `scheduler` parameter #44