analyticalmindsltd / smote_variants

A collection of 85 minority oversampling techniques (SMOTE) for imbalanced learning with multi-class oversampling and model selection features
http://smote-variants.readthedocs.io
MIT License
623 stars 138 forks source link

Why I get this error when I use smote_variants? #43

Open ppleumyy opened 2 years ago

ppleumyy commented 2 years ago

This is my code:

vectorCount = CountVectorizer(tokenizer=tokenize)
X_trainCount = vectorCount.fit_transform(X_train)

tf_transformer = TfidfTransformer(use_idf=False)
tf_transformer.fit(X_trainCount)
X_trainTF = tf_transformer.transform(X_trainCount)

oversampler= sv.MulticlassOversampling(sv.distance_SMOTE())
X_res, y_res = oversampler.sample(X_trainTF,y_train)

and I get this error:

ValueError: provided out is the wrong size for the reduction
gykovacs commented 2 years ago

Could you share the dimensions of X_trainTF and y_train?

ppleumyy commented 2 years ago

Could you share the dimensions of X_trainTF and y_train?

(4621, 2134) , (4621,)

@gykovacs

gykovacs commented 2 years ago

Interesting, which version of Python and numpy are you using? There might have been some changes in the latest versions which have not been checked yet. (up to P3.9 were the tests executed, I should cover the most recent versions soon)

ppleumyy commented 2 years ago

@

Interesting, which version of Python and numpy are you using? There might have been some changes in the latest versions which have not been checked yet. (up to P3.9 were the tests executed, I should cover the most recent versions soon)

python version is 3.7.13 numpy version is 1.21.6

@gykovacs

gykovacs commented 2 years ago

Cool, this is not the case then, it should work with this setup. If it is not much of a burden, could you please prepare a minimal working example, like replacing the X_trainTF and y_train with some random arrays of the same size, feed them into the MulticlassOversampling and see if it fails? I could use that as a minimal working example for debugging.

Also, could you please share the label distribution in y_train? Are the labels of integer type?

ppleumyy commented 2 years ago

Cool, this is not the case then, it should work with this setup. If it is not much of a burden, could you please prepare a minimal working example, like replacing the X_trainTF and y_train with some random arrays of the same size, feed them into the MulticlassOversampling and see if it fails? I could use that as a minimal working example for debugging.

Also, could you please share the label distribution in y_train? Are the labels of integer type?

this is my google colab workspace https://colab.research.google.com/drive/1ETmdFjWEJdayBq_Ji3Eu6qKprrc0lC_G?usp=sharing

and the dataset file: Suicidal_K1_Train.csv

@gykovacs

gykovacs commented 2 years ago

Perfect, I look into it!

ppleumyy commented 2 years ago

Perfect, I look into it!

thank you very much!

@gykovacs

gykovacs commented 2 years ago

Hi @ppleumyy, so, all the smote_variants tools operate on numerical arrays. Your y_train contains strings, and it is a pandas Series, while your X_trainTF is a sparse array (it needs to be dense). So with the following changes, everything seems to work as expected:

y_train[y_train == 'Level 1'] = 1
y_train[y_train == 'Level 2'] = 2
y_train[y_train == 'Level 3'] = 3
y_train[y_train == 'Level 4'] = 4
y_train[y_train == 'Level 5'] = 5
y_train[y_train == 'Other'] = 0

y_train= y_train.values

X_trainTF= X_trainTF.todense()