Information theoretic approach to avoid tuning hyperparameters

Blunde1 commented 4 years ago

Dear xgboost developers, I am "testing the waters" to see if some of my research is interesting and potentially could be implemented in xgboost either now or in the future.

TLDR: Approximate test loss with an information approach to avoid tuning hyperparameters. Theory: https://arxiv.org/abs/2008.05926 Implementation: https://github.com/Blunde1/agtboost Still work in progress, but would appreciate feedback and if interesting for xgb.

The idea An information criterion for the reduction in loss for the splits in gradient tree boosting. The information criterion allows approximating expected reduction in test loss. Thus, the complexity of each tree can be found rather automatically, and it may also be used for early stopping.

How it works I am the main author for this paper: https://arxiv.org/abs/2008.05926 Algorithm 1 illustrates the difference from xgboost and also potential changes. Essentially, the profiling over different splits is viewed as an empirical process that is shown to converge to a known stochastic process (CIR). This is utilized in developing the information criterion, and in "getting around" the non-differentiability of splits. This then alleviates the need for hyperparameters such as max_depth , gamma and num_round, which leads to large computational benefits due to less tuning.

Implementation I have already implemented it in the following repository: https://github.com/Blunde1/agtboost and it works rather well. This repo also works nicely to test out new innovations on the theoretical side. However, currently it does not scale nearly as well as xgb due to limitations of the authors implementation (and not the theory, which should in principle scale). Essentially two things would be changed in xgboost:

Instead of using a fixed boosting rounds, evaluate inequality (29) in the paper.
Instead of using a regularization criterion, evaluate inequality (28) in the paper.

Limitations Only deterministic un-regularized exact gradient tree boosting is currently considered for the information criterion (this was hard enough initially...). This means that L1-L2 regularization, histogram algorithms, and row and column subsampling is not tuned and/or considered by the information criterion. The information criterion essentially is a bound, and can in some cases lead to slightly conservative models. That being said, I am hoping that papers on both considering "prior" beliefs (L1-L2) and histogram algorithm is right around the corner. This would imply that lambda and alpha would be automatically tuned, and that the methodology is compatible with different values of tree_method

The dream Fully automated gradient tree boosting, where hyperparameters of today are either automatically tuned or made redundant. Users should obtain a rather optimal model instantly, without requiering any tuning and CV.

What I am asking I would greatly appreciate feedback on

Is this interesting at all?
If so, is it interesting now or in the future?
Is it so interesting that it could be implemented in xgboost?
or does the theory need to be further developed before considering implementation in xgboost?

trivialfis commented 4 years ago

Thanks for raising the issue! We will look deeper into all materials you have provided. ;-)

deadsoul44 commented 4 months ago

Shameless plug: https://github.com/perpetual-ml/perpetual

dmlc / xgboost

Information theoretic approach to avoid tuning hyperparameters #6394