Imbalanced Learn? - Githubissues

ksyme99 commented 7 years ago

I was wondering how possible it would be to incorporate the sampling preprocessors in Imbalanced learn?

https://github.com/scikit-learn-contrib/imbalanced-learn

I have had a cast around the tpot code but unfortunately can't quite figure out how it hangs together enough to know how painful this would be (even just for myself hacking it in!)

If this is of interest/possible I would have a proper go at incorporating it.

weixuanfu commented 7 years ago

I tried to use config_dict for incorporating imblalanced-learn with the codes below:

#  with imbalanced-learn-0.3.0.dev0
from sklearn.datasets import load_iris
from imblearn.datasets import make_imbalance
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
ratio = {0: 10, 1: 20, 2: 30}
iris = load_iris()
X, y = make_imbalance(iris.data, iris.target, ratio=ratio)

tpot_config = {

    'sklearn.naive_bayes.BernoulliNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    },

    'imblearn.under_sampling.RandomUnderSampler': {
        'ratio': ['minority', 'majority', 'all'],
        'replacement': [True, False]
    }

}

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=3,
                      config_dict=tpot_config, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

However, I got a lot error messages like:

All intermediate steps should be transformers and implement fit and transform. 
'RandomUnderSampler(random_state=None, ratio='minority', replacement=True, return_indices=False)'
(type <class 'imblearn.under_sampling.prototype_selection.random_under_sampler.RandomUnderSampler'>) doesn't

Maybe we need wrap the imblalanced-learn object as subclass of sklearn.base.TransformerMixin and add implementation of transform.

ksyme99 commented 7 years ago

It seems if you are using pipelines than imbalanced-learn comes with it's own implementation,imblearn.pipeline.Pipeline which has a bunch of extra functions to do with transforming and sampling. Looks to be to do with supporting having a different number of examples through a pipeline, rather than just different features. Probably only makes sense for them to be at the start of the pipeline too, and unsure how to enforce that.

saddy001 commented 6 years ago

However, I have used an undersampler of imblearn with success:

    'imblearn.under_sampling.TomekLinks': {
    },

shahlaebrahimi commented 6 years ago

@saddy001 Could you please let me know how to use BalancedBaggingClassifier? Especially, if xgboostis to be selected and tuned as the base estimator? Thanks

romanovzky commented 5 years ago

I guess no progress has been made? The difficulty here is that ImbLearn applies fit and sample, notice the latter is not transform as it does not change features (transformations), only the re-samples (hence sampling).

For this reason, ImbLearn provides its own Pipeline module, as it needs to wrap the sample functionality in a way that makes sense (it only samples on training and not on testing, etc) and is compatible with SciKit-Learn API flow.

Since most real-life data is highly unbalanced, I think ImbLearn compatibility is highly desired.

bencorwin-tc commented 5 years ago

Agreed, this feature would be super useful.

AyrtonB commented 3 years ago

I'm also trying to integrate imblearn with TPOT and have made a number of code changes to try and make it happen. After making changes in what seemed like the obvious places I'm now met with an error which I'm not sure how to deal with.

Any advice would be much appreciated!

~\anaconda3\envs\env\lib\site-packages\tpot\base.py in _update_top_pipeline(self)
    837                                                     error_score="raise")
    838                         break
--> 839                 raise RuntimeError('There was an error in the TPOT optimization '
    840                                    'process. This could be because the data was '
    841                                    'not formatted properly, or because data for '

RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation: https://epistasislab.github.io/tpot/using/

Code Additions/Changes

Need to add a way to check that an object is a resampler

from imblearn.over_sampling import RandomOverSampler

def _is_resampler(estimator):
    return hasattr(estimator, "fit_resample")

assert _is_resampler(RandomOverSampler)

I added _is_resampler to operator_utils.py then included it twice within TPOTOperatorClassFactory at line 201

        if is_classifier(op_obj):
            class_profile["root"] = True
            optype = "Classifier"
        elif is_regressor(op_obj):
            class_profile["root"] = True
            optype = "Regressor"
        elif _is_transformer(op_obj):
            optype = "Transformer"
        elif _is_selector(op_obj):
            optype = "Selector"
        elif _is_resampler(op_obj):
            optype = "Resampler"
        else:
            raise ValueError(
                "optype must be one of: Classifier, Regressor, Transformer, Selector, Resampler"
            )

and line 330

                    if inspect.isclass(doptype):  # a estimator
                        if (
                            issubclass(doptype, BaseEstimator)
                            or is_classifier(doptype)
                            or is_regressor(doptype)
                            or _is_transformer(doptype)
                            or _is_resampler(doptype)
                            or issubclass(doptype, Kernel)
                        ):

As raised by @ksyme99 the pipeline needs to be changed out for one from imblearn.

In base.py I believe we just need to change the following line

from sklearn.pipeline import make_pipeline, make_union

to

from sklearn.pipeline import make_union
from imblearn.pipeline import make_pipeline

And in export_utils I believe we need to change

def _starting_imports(operators, operators_used):

    ...

    if num_op_root > 1:
        return {
            'sklearn.model_selection':  ['train_test_split'],
            'sklearn.pipeline':         ['make_pipeline', 'make_union'],
            'tpot.builtins':  ['StackingEstimator'],
        }
    elif num_op > 1:
        return {
            'sklearn.model_selection':  ['train_test_split'],
            'sklearn.pipeline':         ['make_pipeline']
        }

to

def _starting_imports(operators, operators_used):

    ...

    if num_op_root > 1:
        return {
            'sklearn.model_selection':  ['train_test_split'],
            'sklearn.pipeline':         ['make_union'],
            'imblearn.pipeline':         ['make_pipeline'],
            'tpot.builtins':  ['StackingEstimator'],
        }
    elif num_op > 1:
        return {
            'sklearn.model_selection':  ['train_test_split'],
            'imblearn.pipeline':         ['make_pipeline']
        }

weixuanfu commented 3 years ago

@AyrtonB Could you please share the link of your branch with those changes and also provide a demo to reproduce the error? I can take a look.

AyrtonB commented 3 years ago

The error appears to be specific to a custom component I'm using which requries the index of the passed data. In have this working in imblearn but trying to include this in TPOT was what broke it, one step at a time..

The good news is that I don't have this issue on standard imblearn components and the following will work if you use the fork I've made here, have just made a PR as well.

from tpot import TPOTClassifier
from sklearn.datasets import make_classification

classifier_config_dict = {
    # Classifiers
    'sklearn.ensemble.ExtraTreesClassifier': {
        'n_estimators': [100],
        'criterion': ["gini", "entropy"],
        'max_features': np.arange(0.05, 1.01, 0.05),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21),
        'bootstrap': [True, False]
    },

    # Preprocessors
    'imblearn.over_sampling.RandomOverSampler': {
    },

}

X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)

pipeline_optimizer = TPOTClassifier(generations=5, population_size=10, cv=3,
                                   random_state=42, verbosity=2, n_jobs=-1, 
                                   config_dict=classifier_config_dict, 
                                   template='RandomOverSampler-Classifier')

pipeline_optimizer.fit(X, y)

pipeline = pipeline_optimizer.fitted_pipeline_

Output

Generation 1 - Current best internal CV score: 0.98940007916784

Generation 2 - Current best internal CV score: 0.98940007916784

Generation 3 - Current best internal CV score: 0.9896000391758383

Generation 4 - Current best internal CV score: 0.9896000391758383

Generation 5 - Current best internal CV score: 0.9897999991838367

Best pipeline: ExtraTreesClassifier(RandomOverSampler(input_matrix), bootstrap=False, criterion=gini, max_features=0.3, min_samples_leaf=1, min_samples_split=10, n_estimators=100)
Wall time: 28.4 s

AyrtonB commented 3 years ago

Regarding the specific issue I'm encountering.

Goal: Want to be able to resample based on the group specified in a multi-index Current progress: Custom components work fine in a standard imblearn pipeline Current issue: The custom components break the TPOT regression optimisation

A dummy dataset can be created like this

from sklearn.datasets import make_regression

flatten = lambda t: [item for sublist in t for item in sublist]
months = flatten([[x]*100*x for x in range(1, 13)])
idx = pd.MultiIndex.from_arrays([range(len(months)), months], names=['unique', 'month'])

X, y = make_regression(n_samples=len(idx), n_features=20)

df_X, s_y = pd.DataFrame(X, index=idx), pd.Series(y, index=idx)

df_X

The components are defined in a script called operators.py

from sklearn.ensemble import RandomForestRegressor
from imblearn.over_sampling import RandomOverSampler

def add_series_index(idx_arg_pos=0):
    def decorator(func):
        def decorator_wrapper(*args, **kwargs):
            input_s = args[idx_arg_pos]
            assert isinstance(input_s, (pd.Series, pd.DataFrame))
            result = pd.Series(func(*args, **kwargs), index=input_s.index)
            return result
        return decorator_wrapper
    return decorator

class PandasRandomForestRegressor(RandomForestRegressor):
    def __init__(self, n_estimators=100, *, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None, score_func=None):
        super().__init__(n_estimators=n_estimators, criterion=criterion, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, min_weight_fraction_leaf=min_weight_fraction_leaf, max_features=max_features, max_leaf_nodes=max_leaf_nodes, min_impurity_decrease=min_impurity_decrease, min_impurity_split=min_impurity_split, bootstrap=bootstrap, oob_score=oob_score, n_jobs=n_jobs, random_state=random_state, verbose=verbose, warm_start=warm_start, ccp_alpha=ccp_alpha, max_samples=max_samples)

        if score_func is None:
            self.score_func = r2_score
        else:
            self.score_func = score_func

    @add_series_index(1)
    def predict(self, X):
        pred = super().predict(X)
        return pred

    def score(self, X, y, *args, **kwargs):        
        y_pred = self.predict(X)
        score = self.score_func(y, y_pred, *args, **kwargs)
        return score

def custom_resampler_helper(X, y, class_col, resample_func):
    # Checking indexes match
    assert X.index.equals(y.index), 'X and y indexes should be the same'

    # Extracting idx names and mapping to y values
    idx_names = X.index.names
    idx_to_y = dict(zip(y.reset_index()[idx_names].apply(tuple, axis=1).values, y.values))

    # Resampling values
    classes = X.reset_index()[class_col]
    X_resampled, _ = resample_func(X.reset_index(), classes)
    y_resampled = X_resampled[idx_names].apply(tuple, axis=1).map(idx_to_y)

    # Formatting indexes
    X_resampled = X_resampled.set_index(idx_names)
    y_resampled.index = X_resampled.index

    return X_resampled, y_resampled

class XRandomOverSampler(RandomOverSampler):
    def __init__(self, class_col, sampling_strategy='auto'):
        super().__init__(sampling_strategy='auto')
        self.class_col = class_col

    def fit(self, X):
        classes = X.reset_index()[self.class_col]
        super().fit(X, classes)
        return

    def fit_resample(self, X, y):
        return custom_resampler_helper(X, y, self.class_col, super().fit_resample)

    def fit_sample(self, X, y):
        return self.fit_resample(X, y)

I then create a test pipeline like so:

import operators
from imblearn.pipeline import Pipeline

pipeline = Pipeline([
    ('xros', operators.XRandomOverSampler('month')),
    ('pandas_RF', operators.PandasRandomForestRegressor(n_estimators=100, n_jobs=-1))
])

Which works with the standard sklearn fit/predict

pipeline.fit(df_X, s_y)

df_pred = pipeline.predict(df_X)

However it breaks with TPOT

regressor_config_dict = {
    # Classifiers
    'operators.PandasRandomForestRegressor': {
        'n_estimators': [100],
        'max_features': np.arange(0.05, 1.01, 0.05),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21),
        'bootstrap': [True, False]
    },

    # Preprocessors
    'operators.XRandomOverSampler': {
        'class_col': ['month']
    },
}

pipeline_optimizer = TPOTRegressor(generations=5, population_size=10, cv=3,
                                   random_state=42, verbosity=2, n_jobs=-1, 
                                   config_dict=regressor_config_dict, 
                                   template='XRandomOverSampler-PandasRandomForestRegressor')

pipeline_optimizer.fit(df_X, s_y)

For which I get this error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
c:\path\to\tpot\tpot\base.py in fit(self, features, target, sample_weight, groups)
    742                     per_generation_function=self._check_periodic_pipeline,
--> 743                     log_file=self.log_file_
    744                 )

c:\path\to\tpot\tpot\gp_deap.py in eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar, stats, halloffame, verbose, per_generation_function, log_file)
    280         if per_generation_function is not None:
--> 281             per_generation_function(gen)
    282 

c:\path\to\tpot\tpot\base.py in _check_periodic_pipeline(self, gen)
   1052         """
-> 1053         self._update_top_pipeline()
   1054         if self.periodic_checkpoint_folder is not None:

c:\path\to\tpot\tpot\base.py in _update_top_pipeline(self)
    838                         break
--> 839                 raise RuntimeError('There was an error in the TPOT optimization '
    840                                    'process. This could be because the data was '

RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation: https://epistasislab.github.io/tpot/using/

EpistasisLab / tpot

Imbalanced Learn? #547