fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
571 stars 46 forks source link
artificial-intelligence boosting cpp data-science gaussian-processes machine-learning mixed-effects python r

<img src="https://github.com/fabsig/GPBoost/blob/master/docs/logo/gpboost_logo.png?raw=true" alt="GPBoost icon" align = "right" width="30%" />

GPBoost: Combining Tree-Boosting with Gaussian Process and Mixed Effects Models

Table of Contents

  1. Introduction
  2. Modeling background
  3. News
  4. Open issues - contribute
  5. References
  6. License

Introduction

GPBoost is a software library for combining tree-boosting with Gaussian process and grouped random effects models (aka mixed effects models or latent Gaussian models). It also allows for independently applying tree-boosting as well as Gaussian process and (generalized) linear mixed effects models (LMMs and GLMMs). The GPBoost library is predominantly written in C++, it has a C interface, and there exist both a Python package and an R package.

For more information, you may want to have a look at:

Modeling background

The GPBoost algorithm combines tree-boosting with latent Gaussian models such as Gaussian process (GP) and grouped random effects models. This allows to leverage advantages and remedy drawbacks of both tree-boosting and latent Gaussian models; see below for a list of strength and weaknesses of these two modeling approaches. The GPBoost algorithm can be seen as a generalization of both traditional (generalized) linear mixed effects and Gaussian process models and classical independent tree-boosting (which often has the highest prediction for tabular data).

Advantages of the GPBoost algorithm

Compared to (generalized) linear mixed effects and Gaussian process models, the GPBoost algorithm allows for

Compared to classical independent boosting, the GPBoost algorithm allows for

Modeling details

For Gaussian likelihoods (GPBoost algorithm), it is assumed that the response variable (aka label) y is the sum of a potentially non-linear mean function F(X) and random effects Zb:

y = F(X) + Zb + xi

where F(X) is a sum (="ensemble") of trees, xi is an independent error term, and X are predictor variables (aka covariates or features). The random effects Zb can currently consist of:

For non-Gaussian likelihoods (LaGaBoost algorithm), it is assumed that the response variable y follows a distribution p(y|m) and that a (potentially multivariate) parameter m of this distribution is related to a non-linear function F(X) and random effects Zb:

y ~ p(y|m)
m = G(F(X) + Zb)

where G() is a so-called link function. See here for a list of currently supported likelihoods p(y|m).

Estimating or training the above-mentioned models means learning both the covariance parameters (aka hyperparameters) of the random effects and the predictor function F(X). Both the GPBoost and the LaGaBoost algorithms iteratively learn the covariance parameters and add a tree to the ensemble of trees F(X) using a functional gradient and/or a Newton boosting step. See Sigrist (2022, JMLR) and Sigrist (2023, TPAMI) for more details.

Strength and weaknesses of tree-boosting and linear mixed effects and GP models

Classical independent tree-boosting

Strengths Weaknesses
- State-of-the-art prediction accuracy - Assumes conditional independence of samples
- Automatic modeling of non-linearities, discontinuities, and complex high-order interactions - Produces discontinuous predictions for, e.g., spatial data
- Robust to outliers in and multicollinearity among predictor variables - Can have difficulty with high-cardinality categorical variables
- Scale-invariant to monotone transformations of predictor variables
- Automatic handling of missing values in predictor variables

Linear mixed effects and Gaussian process (GPs) models (aka latent Gaussian models)

Strengths Weaknesses
- Probabilistic predictions which allows for uncertainty quantification - Zero or a linear prior mean (predictor, fixed effects) function
- Incorporation of reasonable prior knowledge. E.g. for spatial data: "close samples are more similar to each other than distant samples" and a function should vary continuously / smoothly over space
- Modeling of dependency which, among other things, can allow for more efficient learning of the fixed effects (predictor) function
- Grouped random effects can be used for modeling high-cardinality categorical variables

News

Open issues - contribute

Software issues

Methodological issues

Computational issues

References

License

This project is licensed under the terms of the Apache License 2.0. See LICENSE for more information.