analyticalmindsltd / smote_variants

A collection of 85 minority oversampling techniques (SMOTE) for imbalanced learning with multi-class oversampling and model selection features
http://smote-variants.readthedocs.io
MIT License
621 stars 138 forks source link

GridSearchCV classifier parameters: int vs list #65

Open VadimKufenko opened 1 year ago

VadimKufenko commented 1 year ago

Thank you very much for providing the smote_variants package - an excellent tool!

Seems that the parameters can not be passed as lists. I have a questions regarding parameter tuning - using the logic from the manual one can continue the grid using integers:

oversampler = ('smote_variants', 'MulticlassOversampling', {'oversampler': 'MWMOTE', 'oversampler_params': {}})

classifier = ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators':50, 'max_depth': 3, 'min_samples_split': 2})

model= Pipeline([('scale', StandardScaler()), ('clf', sv.classifiers.OversamplingClassifier(oversampler, classifier))])

model

param_grid= {'clf__oversampler':[('smote_variants', 'MWMOTE', {'proportion': 0.5}), ('smote_variants', 'MWMOTE', {'proportion': 1.0}), ('smote_variants', 'MWMOTE', {'proportion': 1.5})], 'clf__classifier':[('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 60}), ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 1000}), ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 40}), ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}), ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}), ('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 9}), ('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 4}), ('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 9}), ('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 5}), ] }

Yet in this case, GridSearchCV will result in only one parameter. Another formulation of the grid would result in having all parameters but only the last values of those, which are most likely not optimal:

param_grid= {'clf__classifier': [('sklearn.ensemble', 'RandomForestClassifier', { 'max_depth': 20, 'max_depth': 7, 'max_depth': 9, 'max_depth': 2}, {'n_estimators': 300, 'n_estimators': 180, 'n_estimators': 25, 'n_estimators': 2}, {'min_samples_split': 3, 'min_samples_split': 19, 'min_samples_split': 2}, {'min_samples_leaf': 3, 'min_samples_leaf': 18, 'min_samples_leaf': 2}, )] }

The parameter requirement is basically a dictionary, but with floats or integers and not lists. Could you please provide additional instructions on passing through the parameters to the grid for fine tuning?

Any kind of hints would be very much appreciated. Thank you in advance!

gykovacs commented 1 year ago

Hi @VadimKufenko, thank you for raising the issue, I will look into it in the next couple of days!

VadimKufenko commented 1 year ago

Dear György @gykovacs , thank you so much for following up! This is very kind of you! I would like to express my fascination with the smote_variants package - a tremendous work!

Meanwhile I tried different arrangements of the parameters since for lists and sets one gets the following error:

ValueError: n_estimators must be an integer, got <class 'set'> ValueError: n_estimators must be an integer, got <class 'list'>

I saw that set_params(self, parameters) in the OversamplingClassifier uses dictionaries, but in the end it pins down to integers, so I tried to improvise further. Please see my examples with generated data below - perhaps it helps to identify the issue. With gridsearch, depending on the arrangement, either the i) top parameters are optimized with the rest ignored, or ii) the last ones** are listed as the optimal ones, although they can not be optimal.

Please note that I am using imblearn.pipeline, but I had similar issues with the sklearn pipeline.

Example

import numpy as np

from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from imblearn.pipeline import Pipeline, make_pipeline # IMBLEARN pipeline from sklearn.metrics import precision_score, recall_score, average_precision_score, f1_score, recall_score, make_scorer from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV

import smote_variants as sv from smote_variants.classifiers import OversamplingClassifier

X, y = make_classification(n_samples=2000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, n_clusters_per_class=1, weights=[0.8,0.1,0.1], random_state=42)

oversampler = ('smote_variants', 'MulticlassOversampling', {'oversampler': 'MWMOTE', 'oversampler_params': {'proportion': 1.0}})

classifier = ('sklearn.ensemble', 'RandomForestClassifier', {})

imbalanced_pipeline = make_pipeline(OversamplingClassifier(oversampler, classifier))

Case one

param_grid_one = {'oversamplingclassifier__classifier': [('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 50}), ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 1000}), ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}), ('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 21}), ('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 2}), ('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 2}), ('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 30}), ('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_leaf': 2}), ('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_leaf': 77}) ] }

scorers = {'precision': make_scorer(precision_score, average = 'macro'), 'f1_macro': make_scorer(f1_score, average = 'macro') }

k_folds=5 rskfold = RepeatedStratifiedKFold(n_splits=k_folds, n_repeats=5, random_state=0)

framework_one = GridSearchCV(imbalanced_pipeline, param_grid=param_grid_one, cv=rskfold, n_jobs=-1, refit='f1_macro', verbose=0, scoring=scorers)

results_one=framework_one.fit(X, y)

results_one.bestestimator.named_steps['oversamplingclassifier']

Case two

param_grid_two = {'oversamplingclassifier__classifier': [('sklearn.ensemble', 'RandomForestClassifier', { 'max_depth': 10, 'max_depth': 77, 'n_estimators': 50, 'n_estimators': 1000, 'n_estimators': 11, 'min_samples_split': 2, 'min_samples_split': 33, 'min_samples_leaf': 2, 'min_samples_leaf': 77, })] }

framework_two = GridSearchCV(imbalanced_pipeline, param_grid=param_grid_two, cv=rskfold, n_jobs=-1, refit='f1_macro', verbose=0, scoring=scorers) results_two=framework_two.fit(X, y)

results_two.bestestimator.named_steps['oversamplingclassifier']

Hope this helps - let me know if you have questions on these examples! Thank you!

gykovacs commented 1 year ago

Hi @VadimKufenko , sorry for the delay, I'm looking into this right now. I am not sure if I understand the problem properly, although there are many tricky things here.

First of all, thanks for bringing up the issue. This area (multiclass oversampling with grid parameter selection) is slightly used, therefore there can be inconviniences that I can fix in the short term if we manage to come up with the outline of a better usage.

Four things to highlight in advance: 1) when it comes to multi-class oversampling, explicit values of the proportion parameters are not used. The reason for that is that "proportion of what to what" is unanswered. The proportion parameter is needed internally, to set it varyingly for each class to sample as many samples as needed to match the cardinality of the majority class. That is, given a majority class of 100 vectors, and two further classes with 70 and 50 records, then internally there will be an oversampling with the proportion of 100/70 and another with the proportion of 100/50 to equalize the cardinalities. Whatever is set explicitly as a proportion parameter will be overwritten internally. Therefore, in a multi-class case, grid-search over the proportion parameter executes the same thing behind again and again. 2) For a bunch of reasons, the RandomForestClassifier is not working with SMOTE-like techniques, I mean, it is working, but the performance scores are highest when oversampling is disabled (proportion=0˙``). There are a bunch of reasons for this, mainly that the SMOTE sampling interferes negatively with the internal operations of random forests, namely, bootstrapping and the random feature selection in the decision nodes. (I'm just working on a paper on how to resolve these issues). Generally, I recommend using other classifiers. 3) If the classification problem is fairly imbalanced and oversampling cannot fix the issues, I think it can end in F1 scores being zero (when precision or recall is 0 (that is, there are no true positives). In these cases a grid search might end up in the first parameterization, as the score of all parameterizations is the same, 0. I think it is reasonable to work with roc_auc_score, and derive some average roc_auc_score for multi-class problems. 4) I recognized that the interface ofMulticlassOverSampling``` does not follow the interface of all other objects requiring a tuple of (smote_package, smote_name, smote_parameters), which should be changed in the future.

With all these said, I have come up with something I think you wanted to achieve, and seems to work properly, i.e., not the first or last parameters are selected from the grid: each combination seems to be ealuated properly. Note however, that it contains an iteration over various proportion parameters, which does not make much sense as they are ignored. Let me know if this is what you wanted to achieve, and if not, what is wrong and how should it work?

from sklearn.datasets import make_classification
from imblearn.pipeline import Pipeline
from sklearn.metrics import precision_score, f1_score, make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler

import smote_variants as sv

X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=5, n_classes=3, n_clusters_per_class=1, weights=[0.8,0.1,0.1], random_state=42)

k_folds=5
rskfold = RepeatedStratifiedKFold(n_splits=k_folds, n_repeats=20, random_state=0)

scorers = {'precision': make_scorer(precision_score, average = 'macro'), 'f1_macro': make_scorer(f1_score, average = 'macro') }

dummy_init_oversampler = ('smote_variants', 'MulticlassOversampling', {'oversampler': 'MWMOTE', 'oversampler_params': {}})

dummy_init_classifier = ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators':50, 'max_depth': 3, 'min_samples_split': 2})

model= Pipeline([('scale', StandardScaler()), 
                 ('clf', sv.classifiers.OversamplingClassifier(dummy_init_oversampler, dummy_init_classifier))])

param_grid= {
'clf__oversampler':[('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.0}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.25}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.5}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 0.75}}),
                    ('smote_variants', 'MulticlassOversampling', {'oversampler': 'SMOTE', 'oversampler_params': {'proportion': 1.0}})],
'clf__classifier':[('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': None}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 2}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 10}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 15}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 20}),
                    ('sklearn.tree', 'DecisionTreeClassifier', {'min_samples_leaf': 25})
] }

framework = GridSearchCV(model,
                            param_grid=param_grid,
                            cv=rskfold,
                            n_jobs=-1,
                            refit='f1_macro',
                            verbose=0,
                            scoring=scorers)

results=framework.fit(X, y)

results.best_estimator_.named_steps['clf']

This code ends with this:

OversamplingClassifier(classifier=('sklearn.tree', 'DecisionTreeClassifier',
                                   {'min_samples_leaf': 2}),
                       oversampler=('smote_variants', 'MulticlassOversampling',
                                    {'oversampler': 'SMOTE',
                                     'oversampler_params': {'proportion': 0.5}}))