Difference between `dklearn.Pipeline` and `sklearn.Pipeline`

dask / dask-searchcv

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml

BSD 3-Clause "New" or "Revised" License

240 stars 43 forks source link

Difference between `dklearn.Pipeline` and `sklearn.Pipeline` #13

Closed bkj closed 7 years ago

bkj commented 7 years ago

What's the difference between dklearn.Pipeline and sklearn.Pipeline? Does dklearn cache the intermediate results?

Reasin I ask -- I'm trying to fit a large number of sklearn Pipelines in parallel, comparing building the computational graph w/ dask with a handrolled (way less sophisticated) joblib implementation. dask is ~50% slower, and I wonder whether that's a result of the fact that joblib memmap's large objects to reduce communication overhead. Any thoughts? Happy to share code if it's useful.

mrocklin commented 7 years ago

Pipelines are inherently sequential, so on their own they are not very valuable. They become valuable when combined with other pipelines. When computing the same pipeline with many parameters you may see some benefits from de-duplication. This blogpost from @jcrist may be informative here.

I don't know why the dklearn pipeline would be slower in your case. If possible I recommend trying out both systems with a single thread and then profiling with cProfile, snakeviz, or the prun ipython magic.

dask.set_options(get=dask.async.get_sync)  # use a single threaded scheduler

jcrist commented 7 years ago

Sorry for the lack of response here. This library has seen a lot of work over the last couple weeks.

The library is renamed to dask-searchcv and has been released on pypi and (soon) conda-forge
The Pipeline class is completely removed. Only public facing api is GridSearchCV and RandomizedSearchCV.
A lot of work has gone into improving graph building times. Building graphs for 500,000 candidates now completes in seconds (not that fitting this many candidates is a good idea)
In my simple benchmarks this library should at worst be ~equivalent to scikit-learn's implementation. For certain grids/pipelines, we can be a lot faster. For more information, see this blogpost

Please let us know if you try it and find it significantly slower than scikit-learn. There is no reason why we shouldn't be able to at least be equivalent in all cases.

ghost commented 7 years ago

@bkj I'm creating my own parallel pipeline and curious what approach you took. To me there are two types of pipelines: 1) Large permutation of steps, ie scale, transform then predict or transform, scale then predict. 2) Parameter search within a pipeline. Ie try transform param 1,2,3,etc. Case 2 is applies to GridSearchCV while case 1 I'm looking to parallelize.