Closed data-steve closed 7 years ago
I don't think that dask.dataframe has anything to do with the dask.learn project. Dask-learn strictly handles model parallelism on objects that satisfy the sklearn API. Does this answer your question @data-steve ?
Sorry for the lack of response here. You can pass any dask object (e.g. array/dataframe/delayed object) to the *SearchCV.fit
method, but the enclosed estimator methods will only receive the "computed" version of that object. So if you pass in a dask.dataframe.DataFrame
, your fit
/transform
methods will get a pandas dataframe. In general this library is for fitting many many models on small-medium data, so this isn't seen as a problem, as the benefit of using dask for data-parallelism in these cases is small.
what constitutes small to medium data for dask? In terms of rows or cols or both?
~ Steve
Sent via telepathy
On Apr 5, 2017, at 5:43 PM, Jim Crist notifications@github.com wrote:
Sorry for the lack of response here. You can pass any dask object (e.g. array/dataframe/delayed object) to the *SearchCV.fit method, but the enclosed estimator methods will only receive the "computed" version of that object. So if you pass in a dask.dataframe.DataFrame, your fit/transform methods will get a pandas dataframe. In general this library is for fitting many many models on small-medium data, so this isn't seen as a problem, as the benefit of using dask for data-parallelism in these cases is small.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I guess what I mean here is "anything you'd use scikit-learn in memory with". We don't do anything to parallelize across data or do anything out-of-core, we just parallelize across fitting multiple estimators. What that means is computation dependent.
I've created some custom Sklearn transforms that I'm putting into a pipeline. These custom transforms take a pandas object and apply some function over it like this example to extract text from html strings to pass into a CountVectorizer.
I realize that dask.apply can be generically swapped in for pandas apply, especially if I convert the data to dask dataframes. But since these are in sklearn pipeline, I wasn't sure how much parallelism I'd get.