analyticalmindsltd / smote_variants

A collection of 85 minority oversampling techniques (SMOTE) for imbalanced learning with multi-class oversampling and model selection features
http://smote-variants.readthedocs.io
MIT License
625 stars 138 forks source link

How do I use it along sklearn's Pipeline? #8

Closed roperi closed 5 years ago

roperi commented 5 years ago

This is my use case:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)

oversampler = sv.MulticlassOversampling(sv.distance_SMOTE()) X_train_resamp, y_train_resamp = oversampler.sample(X_train, y_train)

...and I get this error: TypeError: only integer scalar arrays can be converted to a scalar index

That's because my X is a list of text. And I would like to leave it that way as opposed to use a tfidf frequency matrix.

is it possible to use sklearn's Pipeline to avoid the error? If so, how do I integrate it into the following Pipeline?

model = Pipeline([ ('tfidf', TfidfVectorizer(), ('clf', LinearSVC()) ])

gykovacs commented 5 years ago

Hi @h-2-0,

for your first piece of code, the answer is yes, oversampling methods work with numerical arrays only, you can't apply them to lists of strings. Regarding the sklearn Pipeline, it is a very interesting problem. The point is that oversampling changes the training set, according to the oversampling principle, the number of vectors coming out from oversampling is higher than the number of vectors feeded. The problem is that - as far as I know - sklearn Pipeline is not prepared for a changing number of training vectors. All TransformerMixin objects that can appear in the Pipeline must return the same number of vectors as they get as input, hence the name, Transformer, they do only transformation, but not oversampling. Consequently, in its present form, the principle of oversampling is not compatible with the sklearn Pipeline mechanism. Actually there is a debate going on on this, as soon as it gets to some conclusion, the smote_variants package will be updated accordingly. Here you can find the details of the discussion: https://github.com/scikit-learn/scikit-learn/issues/3855

On the other hand, there is a workaround. If you create a classifier object which incorporates oversampling, it can be passed to Pipeline. Let me sketch it below:

from sklearn.base import BaseEstimator, ClassifierMixin

class OverasamplingClassifier(BaseEstimator, ClassifierMixin):

   def __init__(self, oversampler, classifier):
      self.oversampler= oversampler
      self.classifier= classifier

   def fit(self, X, y=None):
      X_samp, y_samp= self.oversampler.sample(X, y)
      self.classifier.fit(X_samp, y_samp)
      return self

   def predict(self, X):
      return self.classifier.predict(X)

   def predict_proba(self, X):
      return self.classifier.predict_proba(X)

   def get_params(self, deep=True):
      return {'oversampler': self.oversampler, 'classifier': self.classifier}

   def set_params(self, **parameters):
      for parameter, value in parameters.items():
         setattr(self, parameter, value)
      return self

with a class like this properly implemented (the above is only a sketch),

model= Pipeline([('tfidf', TfidfVectorizer()), 
                 ('clf', OversamplingClassifier(sv.MulticlassOversampling(sv.distance_SMOTE()),
                                                LinearSVC())
                ])

should work.

roperi commented 5 years ago

Hi, @gykovacs

Thanks for your response and, in general, for your hard work creating smote variants!

I haven't tried your proposal yet as I needed a quick and dirty way to model some data. I ended up using imbalanced-learn's pipeline to be able to get prediction values as well.

I will attempt the following in my next iteration using Sklearn's own Pipeline based on your suggestions:


    ('tfidf', TfidfVectorizer()),
    ('clf', CalibratedClassifierCV(base_estimator=OversamplingClassifier(sv.MulticlassOversampling(sv.distance_SMOTE()),LinearSVC()), cv=5)) 
])``` 
gykovacs commented 5 years ago

Hi @h-2-0,

cool, let me know if it works, and also please let me know of some additional feature would facilitate the use of smote_variants. In the next couple of days I'll try to find some time to add something like OversamplingClassifier to smote_variants to aid integration to the sklearn.Pipeline.

roperi commented 5 years ago

Hello again, @gykovacs !

I can't think of anything to improve smote_variants at the moment since I'm a newbie in sklearn and in NLP in general. But one thing I noticed though is during my research on how to solve my imbalanced classes problem (using Python, sklearn and SMOTE technique) was that most of the articles, StackOverflow, Quora or Reddit posts, and other blogposts in general showing up in the Google search results point to the use of imbalanced-learn. It was only by accident that I came across one of your Youtube videos that I learned about smote_variants. Perhaps you or any of your associates could start posting answers, contributions or posts to people's questions in all these sites (Medium, Quora, Reddit, StackOverflow, etc) to build awareness/drive traffic to your github project. More or better qualified people than me using smote_variants more feedback you'll get. I don't see why newcomers won't use smote_variants since it is comprehensible awesome! But that's just my opinion anyway.

Thanks again!

gykovacs commented 5 years ago

Hi @h-2-0,

thanks, I'm happy to hear that you find it useful. Answering questions in Quora and alike is a great idea! Also, please spread the word! :)

In the meanwhile, I added the OversamplingClassifier feature to the package, you can find some corresponding sample codes here: https://github.com/gykovacs/smote_variants/blob/master/examples/008_sklearn.py

I'm working on pushing the changes to the PyPi repo.