Closed bkj closed 7 years ago
Pipelines are inherently sequential, so on their own they are not very valuable. They become valuable when combined with other pipelines. When computing the same pipeline with many parameters you may see some benefits from de-duplication. This blogpost from @jcrist may be informative here.
I don't know why the dklearn pipeline would be slower in your case. If possible I recommend trying out both systems with a single thread and then profiling with cProfile, snakeviz, or the prun ipython magic.
dask.set_options(get=dask.async.get_sync) # use a single threaded scheduler
Sorry for the lack of response here. This library has seen a lot of work over the last couple weeks.
dask-searchcv
and has been released on pypi and (soon) conda-forgePipeline
class is completely removed. Only public facing api is GridSearchCV
and RandomizedSearchCV
.Please let us know if you try it and find it significantly slower than scikit-learn. There is no reason why we shouldn't be able to at least be equivalent in all cases.
@bkj I'm creating my own parallel pipeline and curious what approach you took. To me there are two types of pipelines: 1) Large permutation of steps, ie scale, transform then predict or transform, scale then predict. 2) Parameter search within a pipeline. Ie try transform param 1,2,3,etc. Case 2 is applies to GridSearchCV while case 1 I'm looking to parallelize.
What's the difference between
dklearn.Pipeline
andsklearn.Pipeline
? Doesdklearn
cache the intermediate results?Reasin I ask -- I'm trying to fit a large number of
sklearn
Pipelines in parallel, comparing building the computational graph w/dask
with a handrolled (way less sophisticated)joblib
implementation.dask
is ~50% slower, and I wonder whether that's a result of the fact thatjoblib
memmap's large objects to reduce communication overhead. Any thoughts? Happy to share code if it's useful.