Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Dear xgboost developers, I am "testing the waters" to see if some of my research is interesting and potentially could be implemented in xgboost either now or in the future.
The idea
An information criterion for the reduction in loss for the splits in gradient tree boosting. The information criterion allows approximating expected reduction in test loss. Thus, the complexity of each tree can be found rather automatically, and it may also be used for early stopping.
How it works
I am the main author for this paper: https://arxiv.org/abs/2008.05926 Algorithm 1 illustrates the difference from xgboost and also potential changes. Essentially, the profiling over different splits is viewed as an empirical process that is shown to converge to a known stochastic process (CIR). This is utilized in developing the information criterion, and in "getting around" the non-differentiability of splits. This then alleviates the need for hyperparameters such as max_depth , gamma and num_round, which leads to large computational benefits due to less tuning.
Implementation
I have already implemented it in the following repository: https://github.com/Blunde1/agtboost and it works rather well. This repo also works nicely to test out new innovations on the theoretical side. However, currently it does not scale nearly as well as xgb due to limitations of the authors implementation (and not the theory, which should in principle scale).
Essentially two things would be changed in xgboost:
Instead of using a fixed boosting rounds, evaluate inequality (29) in the paper.
Instead of using a regularization criterion, evaluate inequality (28) in the paper.
Limitations
Only deterministic un-regularized exact gradient tree boosting is currently considered for the information criterion (this was hard enough initially...). This means that L1-L2 regularization, histogram algorithms, and row and column subsampling is not tuned and/or considered by the information criterion. The information criterion essentially is a bound, and can in some cases lead to slightly conservative models. That being said, I am hoping that papers on both considering "prior" beliefs (L1-L2) and histogram algorithm is right around the corner. This would imply that lambda and alpha would be automatically tuned, and that the methodology is compatible with different values of tree_method
The dream
Fully automated gradient tree boosting, where hyperparameters of today are either automatically tuned or made redundant.
Users should obtain a rather optimal model instantly, without requiering any tuning and CV.
What I am asking
I would greatly appreciate feedback on
Is this interesting at all?
If so, is it interesting now or in the future?
Is it so interesting that it could be implemented in xgboost?
or does the theory need to be further developed before considering implementation in xgboost?
Dear xgboost developers, I am "testing the waters" to see if some of my research is interesting and potentially could be implemented in xgboost either now or in the future.
TLDR: Approximate test loss with an information approach to avoid tuning hyperparameters. Theory: https://arxiv.org/abs/2008.05926 Implementation: https://github.com/Blunde1/agtboost Still work in progress, but would appreciate feedback and if interesting for xgb.
The idea An information criterion for the reduction in loss for the splits in gradient tree boosting. The information criterion allows approximating expected reduction in test loss. Thus, the complexity of each tree can be found rather automatically, and it may also be used for early stopping.
How it works I am the main author for this paper: https://arxiv.org/abs/2008.05926 Algorithm 1 illustrates the difference from xgboost and also potential changes. Essentially, the profiling over different splits is viewed as an empirical process that is shown to converge to a known stochastic process (CIR). This is utilized in developing the information criterion, and in "getting around" the non-differentiability of splits. This then alleviates the need for hyperparameters such as
max_depth
,gamma
andnum_round
, which leads to large computational benefits due to less tuning.Implementation I have already implemented it in the following repository: https://github.com/Blunde1/agtboost and it works rather well. This repo also works nicely to test out new innovations on the theoretical side. However, currently it does not scale nearly as well as xgb due to limitations of the authors implementation (and not the theory, which should in principle scale). Essentially two things would be changed in xgboost:
Limitations Only deterministic un-regularized exact gradient tree boosting is currently considered for the information criterion (this was hard enough initially...). This means that L1-L2 regularization, histogram algorithms, and row and column subsampling is not tuned and/or considered by the information criterion. The information criterion essentially is a bound, and can in some cases lead to slightly conservative models. That being said, I am hoping that papers on both considering "prior" beliefs (L1-L2) and histogram algorithm is right around the corner. This would imply that
lambda
andalpha
would be automatically tuned, and that the methodology is compatible with different values oftree_method
The dream Fully automated gradient tree boosting, where hyperparameters of today are either automatically tuned or made redundant. Users should obtain a rather optimal model instantly, without requiering any tuning and CV.
What I am asking I would greatly appreciate feedback on