analyticalmindsltd / smote_variants

A collection of 85 minority oversampling techniques (SMOTE) for imbalanced learning with multi-class oversampling and model selection features
http://smote-variants.readthedocs.io
MIT License
620 stars 138 forks source link

when use SOMO,Why did the two types of samples not reach a balance and the number did not change #39

Open leaphan opened 3 years ago

gykovacs commented 3 years ago

There can be multiple reasons for that. In many cases the authors of a particular SMOTE variant did not cover all the possible corner cases, for example, 1) all minority samples are treated as noise according to the noise definition of the technique, 2) the method wants to work with, say, 5 nearest neighbors, but there are only 3 minority samples, 3) mathematical techniques like self-organizing maps, do not converge, 4) etc.,

all of these because of the nature of the data is not compatible with the parameter settings and presumptions of the SMOTE variant.

Where I found reasonable resolutions, I implemented them, in those cases when it is unfeasible (for example, determining the 5 closest neighbors when you have only 3 samples in a class), the data is returned unaltered, although I would expect some message in the logs if logging is enabled.

Most likely your data is a corner case of the SOMO implementation with the parameters you used. Adjusting the parameters might lead to a properly operating SOMO.

Also, if you share a minimal working example, I can look into it.

leaphan commented 3 years ago

thanks for your reply, i wrote a code like this:

pip install -U imbalanced-learn pip install smote-variants import numpy as np import smote_variants as sv

import imblearn.datasets as imbd

from imblearn.datasets import fetch_datasets

datasets = fetch_datasets(filter_data=['oil']) X, y = datasets['oil']['data'], datasets['oil']['target'] [print('Class {} has {} instances'.format(label, count)) for label, count in zip(*np.unique(y, return_counts=True))]

oversampler= sv.SOMO() X_samp, y_samp= oversampler.sample(X, y)

[print('Class {} has {} instances after oversampling'.format(label, count)) for label, count in zip(*np.unique(y_samp, return_counts=True))] print(X_samp, y_samp)

and the print result : Class -1 has 896 instances Class 1 has 41 instances Class -1 has 896 instances after oversampling Class 1 has 41 instances after oversampling After oversampling, There is no change in the number of two types of samples.