dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.34k stars 8.73k forks source link

Information theoretic approach to avoid tuning hyperparameters #6394

Open Blunde1 opened 4 years ago

Blunde1 commented 4 years ago

Dear xgboost developers, I am "testing the waters" to see if some of my research is interesting and potentially could be implemented in xgboost either now or in the future.

TLDR: Approximate test loss with an information approach to avoid tuning hyperparameters. Theory: https://arxiv.org/abs/2008.05926 Implementation: https://github.com/Blunde1/agtboost Still work in progress, but would appreciate feedback and if interesting for xgb.

The idea An information criterion for the reduction in loss for the splits in gradient tree boosting. The information criterion allows approximating expected reduction in test loss. Thus, the complexity of each tree can be found rather automatically, and it may also be used for early stopping.

How it works I am the main author for this paper: https://arxiv.org/abs/2008.05926 Algorithm 1 illustrates the difference from xgboost and also potential changes. Essentially, the profiling over different splits is viewed as an empirical process that is shown to converge to a known stochastic process (CIR). This is utilized in developing the information criterion, and in "getting around" the non-differentiability of splits. This then alleviates the need for hyperparameters such as max_depth , gamma and num_round, which leads to large computational benefits due to less tuning.

Implementation I have already implemented it in the following repository: https://github.com/Blunde1/agtboost and it works rather well. This repo also works nicely to test out new innovations on the theoretical side. However, currently it does not scale nearly as well as xgb due to limitations of the authors implementation (and not the theory, which should in principle scale). Essentially two things would be changed in xgboost:

Limitations Only deterministic un-regularized exact gradient tree boosting is currently considered for the information criterion (this was hard enough initially...). This means that L1-L2 regularization, histogram algorithms, and row and column subsampling is not tuned and/or considered by the information criterion. The information criterion essentially is a bound, and can in some cases lead to slightly conservative models. That being said, I am hoping that papers on both considering "prior" beliefs (L1-L2) and histogram algorithm is right around the corner. This would imply that lambda and alpha would be automatically tuned, and that the methodology is compatible with different values of tree_method

The dream Fully automated gradient tree boosting, where hyperparameters of today are either automatically tuned or made redundant. Users should obtain a rather optimal model instantly, without requiering any tuning and CV.

What I am asking I would greatly appreciate feedback on

trivialfis commented 4 years ago

Thanks for raising the issue! We will look deeper into all materials you have provided. ;-)

deadsoul44 commented 4 months ago

Shameless plug: https://github.com/perpetual-ml/perpetual