Improved handling of imbalanced classes

Innixma commented 3 years ago

One way to handle imbalanced classes via downsampling in AutoGluon with bagging would be for each bagging fold to be comprised of: a randomly under-sampled subset of majority class (drawn without replacement) + standard bootstrapped samples of the minority class (or perhaps just the all samples from the minority class). This aims to overcome the downside of under-sampling in that it traditionally discards most the data (in this idea, the majority class data are instead mostly distributed across different bagging folds with undersampling in each fold), and is implemented in:

BalancedRandomForestClassifier

BalancedBaggingClassifier

yxxan commented 2 years ago

Hi does AutoGluon currently handles imbalanced datasets automatically? I have tested it on imbalanced datasets and the performance is quiet good already.

Innixma commented 2 years ago

@yxxan It does not by default, however in v0.3 we added support for class imbalance handling via the sample_weight parameter that can be specified during predictor init: https://auto.gluon.ai/stable/api/autogluon.task.html#module-0

predictor = TabularPredictor(..., sample_weight='balance_weight')

This is not the same functionality as is mentioned in this GitHub issue, but it may be useful to you.

rxjx commented 2 years ago

Does sample_weight only work for Tabular? Is there a similar option for TextPredictor?

Innixma commented 2 years ago

@rxjx Only for Tabular at present.

Jalagarto commented 2 years ago

Can we also play with weighted losses like the focal loss?

Innixma commented 2 years ago

@Jalagarto We don't currently support customized losses, however if you were to implement support for it on a model by model basis, we are happy to accept contributions.

willsmithorg commented 1 year ago

Maybe handling of imbalanced could be combined with https://github.com/awslabs/autogluon/issues/1672 (subsampling for large datasets). In the spirit of AutoML, the user shouldn't need to know about under- or over-sampling (not how to do it, or that it even exists) nor about time or memory constraints that necessitate undersampling.

I'm looking at https://imbalanced-learn.org/ which has a compatible MIT license. It does over- and under- sampling. But it has a lot of depencies. It would solve imbalanced data and over-large data in one package. Once integrated, it would just be a matter of choosing heuristics to:

auto-undersample when time or memory pressure looks likely
auto-over- or under-sample (or both) when highly imbalanced data are detected.

One concern is that the under-sampling techniques mentioned in https://imbalanced-learn.org/ mention things like KNN, which sounds problematic for large datasets. Of course, there's always random undersampling, but it would be nice to do better than that if time and space budgets allow.

Any thoughts to the suitability of https://imbalanced-learn.org/ ?

jqueguiner commented 1 year ago

I’ve been testing unbalanced learn in the past for real life use cases : worked greats it was designed by a student of Gael Varoqaux and later joined the sk learn team to pretty réalisable considering how serious sk is.

polina-l-1 commented 1 year ago

predictor = TabularPredictor(..., sample_weight='balance_weight')

Hello,

This is what I am trying to do to solve a binary classification problem in AG (disbalance is 90-10) and my confusion matrix looks like this (that makes me think something is wrong, probably I am using it in a wrong way?)

ershang2 commented 4 months ago

Thank for you. I hava one question. How to use BalancedRandomForestClassifier

BalancedBaggingClassifier

Vikram12301 commented 1 month ago

I am using the code as below for focal loss

predictor = TabularPredictor(
    label=target,
    path=results_path,
    problem_type=problem_type,
    # sample_weight='balance_weight'
)

predictor = predictor.fit(
    train_df,
    time_limit=TIME_LIMIT,
    presets = PRESETS,  ## high_quality,best_quality can also be used
    auto_stack=AUTO_STACK,  # SET IT TO TRUE FOR MORE ACCURACY IN MODEL(TAKES MORE TIME)
    hyperparameters={
        "optimization.loss_function": "focal_loss",
        "optimization.focal_loss.alpha": weights,
        "optimization.focal_loss.gamma": 1.0,
        "optimization.focal_loss.reduction": "sum",
        "optimization.max_epochs": 10,
    }
)

I am getting error : PicklingError: Can't pickle <function accuracy_score at 0x7fea00bdb1c0>: it's not the same object as sklearn.metrics._classification.accuracy_score sklearn version - 1.4.0 autogluon version - 1.1.1 Reference

autogluon / autogluon

Improved handling of imbalanced classes #1254