dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

Samplers / pipelines for imbalanced datasets #317

Open TomAugspurger opened 6 years ago

TomAugspurger commented 6 years ago

Imbalanced datasets, where the classes have very different occurrence rates, can show up in large data sets.

There are many strategies for dealing with imbalanced data. http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html implements a set, some of which could be scaled to large datasets with dask.

sephib commented 4 years ago

Hi, I think that most of the changes would be to introduce the option of fit_resample and fit_sample into the fit_transform method.
I'll be happy to assist on this issue.

TomAugspurger commented 4 years ago

@sephib do you have any examples of fit_resample and fit_sample? I'm not familiar with them.

sephib commented 4 years ago

The core fit_resample function is from within imblearn/base.py.
It is incorporated throughout the imblearn library - for example here is the implementation within imblearn pipeline

TomAugspurger commented 4 years ago

Thanks. The standard sklearn.pipeline.Pipeline works well with dask containers. Does the one in imblearn work with Dask objects? If not, what breaks?

On Tue, Mar 3, 2020 at 3:43 AM sephib notifications@github.com wrote:

The core fit_resample https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5ae/imblearn/base.py#L54 function is from within imblearn/base.py https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5ae/imblearn/base.py . It is incorporated throughout the imblearn library - for example here is the implementation within imblearn pipeline https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5aed61f2e5dc0e8af87d97ea92b95dcafdd0/imblearn/pipeline.py#L333

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/317?email_source=notifications&email_token=AAKAOIXTMCM6CANWCZF4BSLRFTGKRA5CNFSM4FMLFMH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSY6EI#issuecomment-593858321, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIUBOVBXDONJBYWE3CLRFTGKRANCNFSM4FMLFMHQ .

sephib commented 4 years ago

Currently when I ran daskml with an imblearn pipeline I got an error:

AttributeError: 'FunctionSamplerw object has no attribute 'transform'

This is from the dask_ml/model_selection/method.py fit_transform function which is looking for a fit_transform or fit and transform attributes (which are in imblearn "converted" to fit_resample

TomAugspurger commented 4 years ago

It'd would help to have a minimal minimal example: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

sephib commented 4 years ago

Hi Here is a sample code that passes the dask_ml/model_selection/methods.py . Unfortunately it still does not pass the /imblearn/base.py file but I think it may be something with the example

when amending the file with

from imblearn.pipeline import Pipeline

instead of

from sklearn.pipeline import Pipeline

and adding these lines into the fit_transform function after line 260

elif hasattr(est, "fit_resample"):
                Xt = est.fit_resample(X, y, **fit_params)
rom sklearn.model_selection import train_test_split as tts
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier as KNN
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import (EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours)
import dask_ml.model_selection as dcv
from sklearn.model_selection import GridSearchCV

# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=1.25, weights=[0.3, 0.7],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=5, n_clusters_per_class=1,
                           n_samples=5000, random_state=10)

# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)

# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

# Create the classifier
knn = KNN(1)

# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)

# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)
param_grid = {"pca__n_components":[1, 2, 3],}

# grid = GridSearchCV(pipeline, param_grid=param_grid)
grid = dcv.GridSearchCV(pipeline, param_grid=param_grid)

grid.fit(X_train, y_train)

Any inputs would be appreciated

TomAugspurger commented 4 years ago

Thanks. So the issue is with dask_ml.model_selection.GridSearchCV? I'm confused about how this would work with scikit-learn, since (AFAIK) fit_resample isn't part of their API.

sephib commented 4 years ago

That's the magic of imblearn.pipeline (if you un-comment the dvc.GirdSearchCV and leave the sklearn GridSearchCV the code runs without any errors).

TomAugspurger commented 4 years ago

I don't really see how that would work. But feel free to propose changes in a PR and we can discuss that there.

On Mon, Mar 9, 2020 at 4:50 PM sephib notifications@github.com wrote:

That's the magic of imblearn.pipeline (if you un-comment the dvc.GirdSearchCV and leave the sklearn GridSearchCV the code runs without any errors).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/317?email_source=notifications&email_token=AAKAOIS7W5XODNWP5WGBPMDRGVQAZA5CNFSM4FMLFMH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJGJKA#issuecomment-596796584, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ7ML7HENASN5PPL5LRGVQAZANCNFSM4FMLFMHQ .

glemaitre commented 3 years ago

@TomAugspurger

I started a POC to adapt our RandomUnderSampler to support natively dask array and dataframe (in/out): https://github.com/scikit-learn-contrib/imbalanced-learn/pull/777

I think that we can do something similar for both RandomOverSampler and ClusterCentroids. They don't rely on kNN and thus make it possible to work in a distributed setting. The other methods rely on kNN and I am not sure that it would be easy to do anything then.

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

I was wondering if you would have a bit of time just to check if, on the dask part, we don't implement something stupid in the above PR (I am not super familiar yet with distributed computation).

sephib commented 3 years ago

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

@TomAugspurger is a PR still relevant? if so i'll be happy to get some guidance

TomAugspurger commented 3 years ago

I'm not sure what's required, but perhaps imbalanced-learn's Pipeline will just be able to accept Dask collections after that pull request? I don't know what estimators like GridSearchCV need to do (if anything) to work with imbalanced-learn pipelines.

On Fri, Nov 6, 2020 at 7:37 AM sephib notifications@github.com wrote:

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

@TomAugspurger https://github.com/TomAugspurger is a PR still relevant? if so i'll be happy to get some guidance

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/317#issuecomment-723084512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIWQIZFLMFA25KATC5DSOP3YRANCNFSM4FMLFMHQ .

sephib commented 3 years ago

I guess we can see how the @glemaitre PR goes through and then see if there is anything else to do on dask-ml side

vishalvvs commented 2 years ago

Does imblearn supports Dask Natively?? I have been using joblib with parallel_backend = "dask" for it, but it seems that it is not able to parallelize my tasks.

Jose-Bastos commented 5 months ago

Any updates on this? For example, could I use RandomOverSampler if I use @glemaitre 's PR with minor changes? Thank you in advance!