Closed roperi closed 5 years ago
Hi @h-2-0,
for your first piece of code, the answer is yes, oversampling methods work with numerical arrays only, you can't apply them to lists of strings. Regarding the sklearn Pipeline, it is a very interesting problem. The point is that oversampling changes the training set, according to the oversampling principle, the number of vectors coming out from oversampling is higher than the number of vectors feeded. The problem is that - as far as I know - sklearn Pipeline is not prepared for a changing number of training vectors. All TransformerMixin
objects that can appear in the Pipeline must return the same number of vectors as they get as input, hence the name, Transformer, they do only transformation, but not oversampling. Consequently, in its present form, the principle of oversampling is not compatible with the sklearn Pipeline mechanism. Actually there is a debate going on on this, as soon as it gets to some conclusion, the smote_variants
package will be updated accordingly. Here you can find the details of the discussion: https://github.com/scikit-learn/scikit-learn/issues/3855
On the other hand, there is a workaround. If you create a classifier object which incorporates oversampling, it can be passed to Pipeline. Let me sketch it below:
from sklearn.base import BaseEstimator, ClassifierMixin
class OverasamplingClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, oversampler, classifier):
self.oversampler= oversampler
self.classifier= classifier
def fit(self, X, y=None):
X_samp, y_samp= self.oversampler.sample(X, y)
self.classifier.fit(X_samp, y_samp)
return self
def predict(self, X):
return self.classifier.predict(X)
def predict_proba(self, X):
return self.classifier.predict_proba(X)
def get_params(self, deep=True):
return {'oversampler': self.oversampler, 'classifier': self.classifier}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
with a class like this properly implemented (the above is only a sketch),
model= Pipeline([('tfidf', TfidfVectorizer()),
('clf', OversamplingClassifier(sv.MulticlassOversampling(sv.distance_SMOTE()),
LinearSVC())
])
should work.
Hi, @gykovacs
Thanks for your response and, in general, for your hard work creating smote variants!
I haven't tried your proposal yet as I needed a quick and dirty way to model some data. I ended up using imbalanced-learn's pipeline to be able to get prediction values as well.
I will attempt the following in my next iteration using Sklearn's own Pipeline based on your suggestions:
('tfidf', TfidfVectorizer()),
('clf', CalibratedClassifierCV(base_estimator=OversamplingClassifier(sv.MulticlassOversampling(sv.distance_SMOTE()),LinearSVC()), cv=5))
])```
Hi @h-2-0,
cool, let me know if it works, and also please let me know of some additional feature would facilitate the use of smote_variants
. In the next couple of days I'll try to find some time to add something like OversamplingClassifier
to smote_variants
to aid integration to the sklearn.Pipeline
.
Hello again, @gykovacs !
I can't think of anything to improve smote_variants at the moment since I'm a newbie in sklearn and in NLP in general. But one thing I noticed though is during my research on how to solve my imbalanced classes problem (using Python, sklearn and SMOTE technique) was that most of the articles, StackOverflow, Quora or Reddit posts, and other blogposts in general showing up in the Google search results point to the use of imbalanced-learn
. It was only by accident that I came across one of your Youtube videos that I learned about smote_variants
. Perhaps you or any of your associates could start posting answers, contributions or posts to people's questions in all these sites (Medium, Quora, Reddit, StackOverflow, etc) to build awareness/drive traffic to your github project. More or better qualified people than me using smote_variants more feedback you'll get. I don't see why newcomers won't use smote_variants since it is comprehensible awesome! But that's just my opinion anyway.
Thanks again!
Hi @h-2-0,
thanks, I'm happy to hear that you find it useful. Answering questions in Quora and alike is a great idea! Also, please spread the word! :)
In the meanwhile, I added the OversamplingClassifier
feature to the package, you can find some corresponding sample codes here:
https://github.com/gykovacs/smote_variants/blob/master/examples/008_sklearn.py
I'm working on pushing the changes to the PyPi repo.
This is my use case:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
oversampler = sv.MulticlassOversampling(sv.distance_SMOTE())
X_train_resamp, y_train_resamp = oversampler.sample(X_train, y_train)
...and I get this error:
TypeError: only integer scalar arrays can be converted to a scalar index
That's because my X is a list of text. And I would like to leave it that way as opposed to use a tfidf frequency matrix.
is it possible to use sklearn's Pipeline to avoid the error? If so, how do I integrate it into the following Pipeline?
model = Pipeline([ ('tfidf', TfidfVectorizer(), ('clf', LinearSVC()) ])