analyticalmindsltd / smote_variants

A collection of 85 minority oversampling techniques (SMOTE) for imbalanced learning with multi-class oversampling and model selection features
http://smote-variants.readthedocs.io
MIT License
623 stars 138 forks source link

Question: Regarding time complexity of Oversamplers and "Noise Filters" #60

Closed BradKML closed 1 year ago

BradKML commented 1 year ago

For Scikit Learn some have created tools for demoing latency (model fitting) against error.

The Scitime estimator is useful for some of the algorithms in Scikit-learn but not all

It would be useful to benchmark and measure the time complexity of oversamplers and see which ones are fast (or not) based on size of dataset and log-odds of majority proportions.

gykovacs commented 1 year ago

I agree, the time complexity of oversampling techniques is somewhat unexplored. Some runtime measurements are incorporated though. There was an extensive evaluation (shared in corresponding papers), and based on the average runtimes on 104 datasets a ranking of oversampling techniques is available. For example, if one is interested in the 10 quickest techniques overall, then can query them as

import smote_variants as sv

# get 10 quickest oversamplers
oversamplers = sv.get_all_oversamplers(n_quickest=10)

Although this is not a true time complexity analysis, it can still be used to query computationally efficient techniques for further research or application purposes.

Nevertheless, a proper time complexity analysis by varying number of majority and minority samples, features, imbalance ratios class overlap, etc. would be very useful.

Regarding the noise filters, they are not intended to be primary or necessary steps for oversampling pipelines, but similar analysis on them could still be useful.