ATOMScience-org / AMPL

The ATOM Modeling PipeLine (AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.
MIT License
136 stars 67 forks source link

xgboost and RF models should support class balancing weights in loss function #318

Open mcloughlin2 opened 4 months ago

mcloughlin2 commented 4 months ago

The xgboost, RF and NN models all have different ways to handle imbalanced classification datasets by using class-specific weights in their loss functions; but we currently only support this for NN models, by setting the weight_transform_type parameter to 'balancing'. We should add this capability for random forests and xgboost models as well. For RF models this means setting the class_weights parameter to 'balanced' when we create the RandomForestClassifier. For xgboost models you do it by setting the scale_pos_weight parameter to sum(negative instances) / sum(positive instances).

mcloughlin2 commented 4 months ago

Implemented; changes pushed to branch 'sparsity'.