automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.49k stars 1.27k forks source link

How to weight a given class? [class balancing] #1596

Open UnixJunkie opened 1 year ago

UnixJunkie commented 1 year ago

Is it possible to give a list of weights that should be tried for a given class? I have some data where very heavy reweighting of the under-represented class is necessary to get any good classifier.

I don't know in advance what is the weight to use; apparently it depends of the ML being used. So, this is another hyper parameter that needs to be optimized.

aron-bram commented 1 year ago

Hi,

Unfortunately we do not provide a way to give a list of class weights to be tried out during the optimization process.

Although, by default auto-sklearn should handle the imbalance in the dataset by also including estimators in the search that use sample/class weights, and sets their weights to be the inverse of each class's frequency (refer to Balancing for implementation). Similarly to how sklearn's "balance" value for the class_weights parameter works with some estimators.

May I ask, what performance you reached on this dataset using auto-sklearn and how it compared to some other methods?

In general, an alternative would be to oversample the under-represented class or to undersample the over-represented one. Not sure if this is a good enough option for you, though.

Or you could define your custom metric in auto-sklearn that somehow takes the imbalance of the classes into account.

You may also be interested in defining your own balancing component (Extending Auto-Sklearn with Classification Component example)

I hope I could help, and please feel free to follow up on it.

Let me know @eddiebergman if I forgot about something.

UnixJunkie commented 1 year ago

Class weight is just another hyperparameter that needs to be optimized in some datasets, with some ML methods (like SVM). Using inverse of the class frequency is just an initial guess. Sometimes very far a guess from what optimization would give you.

auto-sklearn miserably failed on this dataset; while by hand I could optimize a model using liblinear (and very strong class weighting for the under-represented class). So, auto-skelarn AUC's was 0.5; mine was 0.58 (yes, it is a hard binary classification dataset).

Trying to resample the classes doesn't help on this dataset. I tried bagging for class balancing.

There are already metrics in there that take class imbalance into account (e.g. AUC if you output probabilities is fine).

FYI, caret allows users to pass the class weights to try to all methods that support class weights. Although caret doesn't do it right: it should be optimized like all other hyperparameters, not scanned by the user.

aron-bram commented 1 year ago

We do realize that handling it as a hyperparameter would improve results achieved on such extremely unbalanced datasets. It just hasn't been a prority for us given the lack of such requests. But thank you for your suggestion, it indeed has the potential to improve the library.

We will consider adding this as a floating-point hyperparameter, which could be used by the Balancing class. However, I can not yet give you an exact date by which this feature will be included unfortunately. Is this an urgent issue for you?

If so, then you could implement your own balancing class as indicated at the bottom of my previous answer. This is far from being the optimal solution, but it should work. I can try to give you a hint on how to achieve this with a dummy implementation soon.

Thank you for your patience.

UnixJunkie commented 1 year ago

This is not urgent; auto-sklearn fails on this dataset, so I don't use it.