re Custom Sklearn Transforms with Dask Apply inside

dask / dask-searchcv

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml

BSD 3-Clause "New" or "Revised" License

240 stars 43 forks source link

re Custom Sklearn Transforms with Dask Apply inside #30

Closed data-steve closed 7 years ago

data-steve commented 7 years ago

I've created some custom Sklearn transforms that I'm putting into a pipeline. These custom transforms take a pandas object and apply some function over it like this example to extract text from html strings to pass into a CountVectorizer.

class GetText(BaseEstimator, TransformerMixin):

    def get_text(self, html_string):
        return lxml.html.document_fromstring(html_string).text_content()

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        return X.apply(self.get_text)

I realize that dask.apply can be generically swapped in for pandas apply, especially if I convert the data to dask dataframes. But since these are in sklearn pipeline, I wasn't sure how much parallelism I'd get.

mrocklin commented 7 years ago

I don't think that dask.dataframe has anything to do with the dask.learn project. Dask-learn strictly handles model parallelism on objects that satisfy the sklearn API. Does this answer your question @data-steve ?

jcrist commented 7 years ago

Sorry for the lack of response here. You can pass any dask object (e.g. array/dataframe/delayed object) to the *SearchCV.fit method, but the enclosed estimator methods will only receive the "computed" version of that object. So if you pass in a dask.dataframe.DataFrame, your fit/transform methods will get a pandas dataframe. In general this library is for fitting many many models on small-medium data, so this isn't seen as a problem, as the benefit of using dask for data-parallelism in these cases is small.

data-steve commented 7 years ago

what constitutes small to medium data for dask? In terms of rows or cols or both?

~ Steve

Sent via telepathy

On Apr 5, 2017, at 5:43 PM, Jim Crist notifications@github.com wrote:

Sorry for the lack of response here. You can pass any dask object (e.g. array/dataframe/delayed object) to the *SearchCV.fit method, but the enclosed estimator methods will only receive the "computed" version of that object. So if you pass in a dask.dataframe.DataFrame, your fit/transform methods will get a pandas dataframe. In general this library is for fitting many many models on small-medium data, so this isn't seen as a problem, as the benefit of using dask for data-parallelism in these cases is small.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jcrist commented 7 years ago

I guess what I mean here is "anything you'd use scikit-learn in memory with". We don't do anything to parallelize across data or do anything out-of-core, we just parallelize across fitting multiple estimators. What that means is computation dependent.