analyticalmindsltd / smote_variants

A collection of 85 minority oversampling techniques (SMOTE) for imbalanced learning with multi-class oversampling and model selection features
http://smote-variants.readthedocs.io
MIT License
631 stars 137 forks source link

Question: Combining these with Undersampling #59

Closed BradKML closed 2 years ago

BradKML commented 2 years ago

SMOTE variants can be used with Undersamplers to speed up classification of imbalanced datasets. However oversampling normally precedes undersampling. Is it possible to generate minority samples that are less than the majority? https://github.com/scikit-learn-contrib/imbalanced-learn/issues/925

gykovacs commented 2 years ago

For numerous oversampling techniques it is certainly possible. There are a bunch of oversampling algorithms, e.g. SMOTE_PSO, etc., which do optimize the number of samples being generated. With these techniques it is up to the algorithm how many minority samples will be generated in the end. However, in many cases one can set the number of samples to be generated through the proportion parameter of the oversampling class.

Namely, let N_min and N_maj denote the number of minority and majority samples, thus, the difference is N_maj - N_min. The proportion parameter specifies the number of samples to be generated in terms of this difference. Particularly, proportion * (N_maj - N_min) samples will be generated. For example, if proportion is set to 1, then the class label distribution will be equalized as the number of minority samples will match the number of majority samples after oversampling. If proportion is set to less than 1, then less samples are generated.

If you want to generate a certain number of samples, for example, 10 additional minority samples are desired, then you can set the proportion parameter to 10 / (N_maj - N_min).

The proportion parameter is supported by more than 60 oversampling techniques in the smote-variants package.

BradKML commented 2 years ago

Sorry for asking, but in the API doc it is not noted which of the 80+ oversampling techniques supports proportion.

If you want to generate a certain number of samples, for example, 10 additional minority samples are desired, then you can set the proportion parameter to 10 / (N_maj - N_min).

What if I want to 5x or 10x minority samples before using an undersampler?

gykovacs commented 2 years ago

That's a good point. I have just created a release (0.7.1) with an additional query function 'get_proportion_oversamplers' to get all oversampler classes with proportion parameters:

import smote_variants as sv

prop_oversamplers = sv.get_proportion_oversamplers() # list of all oversampler classes with proportion parameters

Also please note that despite having a proportion parameter, it might be inaccurate as some oversampling techniques change the number of majority samples (e.g. by noise filtering). Those which use proportion accurately (do not change the majority samples) are exactly the ones which are suitable for multiclass oversampling. You can query these by

import smote_variants as sv

extensive_oversamplers = sv.get_multiclass_oversamplers() # the list of all oversamplers having a proportion parameter and only extending the set of minority samples (leaving the majority samples intact)

Regarding the combination of oversamplers and filters, it is completely up to the user how he combines them. There are some oversampling techniques which inherently contain some noise filter (like SMOTE_TomekLinks). As these noise filters are used in multiple oversampling techniques, they have been put into a separate module for the ease of reuse. However, one can use these prior to oversampling or on the result of oversampling without any restriction, any pipeline of noise filters and oversampling techniques can be constructed.

To generate minority samples, say, M times the original N_min, one needs to set the proportion parameter to M*N_min/(N_maj - N_min).