Open Innixma opened 3 years ago
Hi does AutoGluon currently handles imbalanced datasets automatically? I have tested it on imbalanced datasets and the performance is quiet good already.
@yxxan It does not by default, however in v0.3 we added support for class imbalance handling via the sample_weight
parameter that can be specified during predictor init: https://auto.gluon.ai/stable/api/autogluon.task.html#module-0
predictor = TabularPredictor(..., sample_weight='balance_weight')
This is not the same functionality as is mentioned in this GitHub issue, but it may be useful to you.
Does sample_weight only work for Tabular? Is there a similar option for TextPredictor?
@rxjx Only for Tabular at present.
Can we also play with weighted losses like the focal loss?
@Jalagarto We don't currently support customized losses, however if you were to implement support for it on a model by model basis, we are happy to accept contributions.
Maybe handling of imbalanced could be combined with https://github.com/awslabs/autogluon/issues/1672 (subsampling for large datasets). In the spirit of AutoML, the user shouldn't need to know about under- or over-sampling (not how to do it, or that it even exists) nor about time or memory constraints that necessitate undersampling.
I'm looking at https://imbalanced-learn.org/ which has a compatible MIT license. It does over- and under- sampling. But it has a lot of depencies. It would solve imbalanced data and over-large data in one package. Once integrated, it would just be a matter of choosing heuristics to:
One concern is that the under-sampling techniques mentioned in https://imbalanced-learn.org/ mention things like KNN, which sounds problematic for large datasets. Of course, there's always random undersampling, but it would be nice to do better than that if time and space budgets allow.
Any thoughts to the suitability of https://imbalanced-learn.org/ ?
I’ve been testing unbalanced learn in the past for real life use cases : worked greats it was designed by a student of Gael Varoqaux and later joined the sk learn team to pretty réalisable considering how serious sk is.
predictor = TabularPredictor(..., sample_weight='balance_weight')
Hello,
This is what I am trying to do to solve a binary classification problem in AG (disbalance is 90-10) and my confusion matrix looks like this (that makes me think something is wrong, probably I am using it in a wrong way?)
Thank for you. I hava one question. How to use BalancedRandomForestClassifier
I am using the code as below for focal loss
predictor = TabularPredictor(
label=target,
path=results_path,
problem_type=problem_type,
# sample_weight='balance_weight'
)
predictor = predictor.fit(
train_df,
time_limit=TIME_LIMIT,
presets = PRESETS, ## high_quality,best_quality can also be used
auto_stack=AUTO_STACK, # SET IT TO TRUE FOR MORE ACCURACY IN MODEL(TAKES MORE TIME)
hyperparameters={
"optimization.loss_function": "focal_loss",
"optimization.focal_loss.alpha": weights,
"optimization.focal_loss.gamma": 1.0,
"optimization.focal_loss.reduction": "sum",
"optimization.max_epochs": 10,
}
)
I am getting error : PicklingError: Can't pickle <function accuracy_score at 0x7fea00bdb1c0>: it's not the same object as sklearn.metrics._classification.accuracy_score sklearn version - 1.4.0 autogluon version - 1.1.1 Reference
One way to handle imbalanced classes via downsampling in AutoGluon with bagging would be for each bagging fold to be comprised of: a randomly under-sampled subset of majority class (drawn without replacement) + standard bootstrapped samples of the minority class (or perhaps just the all samples from the minority class). This aims to overcome the downside of under-sampling in that it traditionally discards most the data (in this idea, the majority class data are instead mostly distributed across different bagging folds with undersampling in each fold), and is implemented in:
BalancedRandomForestClassifier
BalancedBaggingClassifier