dask / dask-searchcv

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml
BSD 3-Clause "New" or "Revised" License
240 stars 43 forks source link

Difference between `dklearn.Pipeline` and `sklearn.Pipeline` #13

Closed bkj closed 7 years ago

bkj commented 7 years ago

What's the difference between dklearn.Pipeline and sklearn.Pipeline? Does dklearn cache the intermediate results?

Reasin I ask -- I'm trying to fit a large number of sklearn Pipelines in parallel, comparing building the computational graph w/ dask with a handrolled (way less sophisticated) joblib implementation. dask is ~50% slower, and I wonder whether that's a result of the fact that joblib memmap's large objects to reduce communication overhead. Any thoughts? Happy to share code if it's useful.

mrocklin commented 7 years ago

Pipelines are inherently sequential, so on their own they are not very valuable. They become valuable when combined with other pipelines. When computing the same pipeline with many parameters you may see some benefits from de-duplication. This blogpost from @jcrist may be informative here.

I don't know why the dklearn pipeline would be slower in your case. If possible I recommend trying out both systems with a single thread and then profiling with cProfile, snakeviz, or the prun ipython magic.

dask.set_options(get=dask.async.get_sync)  # use a single threaded scheduler
jcrist commented 7 years ago

Sorry for the lack of response here. This library has seen a lot of work over the last couple weeks.

Please let us know if you try it and find it significantly slower than scikit-learn. There is no reason why we shouldn't be able to at least be equivalent in all cases.

ghost commented 7 years ago

@bkj I'm creating my own parallel pipeline and curious what approach you took. To me there are two types of pipelines: 1) Large permutation of steps, ie scale, transform then predict or transform, scale then predict. 2) Parameter search within a pipeline. Ie try transform param 1,2,3,etc. Case 2 is applies to GridSearchCV while case 1 I'm looking to parallelize.