arogozhnikov / hep_ml

Machine Learning for High Energy Physics.
https://arogozhnikov.github.io/hep_ml/
Other
176 stars 64 forks source link

Porting loss function to XGBoost #76

Open kawaho opened 2 years ago

kawaho commented 2 years ago

Hi authors of hep_ml, I am wondering if there is an easy way to use the loss function (in particular the BinFlatnessLossFunction) from this package in XGBoost since XGBoost support custom loss function in the typical grad, hess format (https://xgboost.readthedocs.io/en/stable/tutorials/custom_metric_obj.html). This could help to improve the speed of training since hep_ml does not support multithreading (please correct me if I am wrong).
Thanks, Andy

arogozhnikov commented 2 years ago

it should be possible. Just try and see.

hep_ml has more general loss format, see here: https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/losses.py#L88-L138

you need init, fit, and prepare_tree_params within xgboost.

Difference with other methods is its ability to remember additional characteristics of observation (such as control variables). Possibility of such factors is ignored by most loss functions I'm aware about: they assume that loss for each observation does not depend on others. So depending on implementation in xgboost (i.e. if it preserves order of observations on each call) you can just init & fit outside of xgboost, then wrap prepare_tree_params and pass to xgboost loss.

That said, I'd start from checking that you're really bottlenecked by tree building, not loss computation (because flatness computation is rather resource-consuming). If so - you'll see no benefit from moving to xgboost.