fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
569 stars 46 forks source link

proportion outcome between [0,1] #75

Open mikejacktzen opened 2 years ago

mikejacktzen commented 2 years ago

Hi, @fabsig thank you for your work, this sounds like an exciting method.

IIRC, you currently only support 0/1 binary outcomes with a logistic link, ctrl+F searching for 'logit'

https://github.com/fabsig/GPBoost/blob/18f32760ac8617e3db65e1b0993fc4f00c1017be/include/GPBoost/likelihoods.h

Is there a small modification you can make where it enables gpboost to run on proportion outcomes between [0,1] ?

# in R

X = matrix(rnorm(2*100), ncol=2)
b = c(2, -2)
eta = X%*%b - 1
p = plogis(eta)
n = rep(10, length(p))
# Simulate y wins in n games
y = rbinom(100, n, p)

outcome_xgb = y/n
group_data = sample(letters[1:4], length(p),replace=TRUE)

gp_model <- fitGPModel(group_data=group_data, likelihood="bernoulli_logit", y=outcome_xgb, X=X)

I get the obvious error

Error in gpb.call("GPB_OptimLinRegrCoefCovPar_R", ret = NULL, private$handle, : [GPBoost] [Fatal] Response variable (label) data needs to be 0 or 1 for likelihood of type 'bernoulli_logit'.

I was hoping it would work similary to xgboost when using the 'reg:logistic' objective https://datascience.stackexchange.com/questions/10595/difference-between-logistic-regression-and-binary-logistic-regression


library(xgboost)
# ?xgboost
# https://datascience.stackexchange.com/questions/10595/difference-between-logistic-regression-and-binary-logistic-regression

# objective = "reg:logistic"
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2, objective = "reg:logistic")

# convert pair (y,n) into scalar proportion (y/n)
outcome_xgb = y/n
dtrain <- xgb.DMatrix(X, label = outcome_xgb)
dtest <- xgb.DMatrix(X, label = outcome_xgb)
watchlist <- list(train = dtrain, eval = dtest)

bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
pred <- predict(bst,dtrain)
head(pred)

This feature would solve a special use case and I think would be helpful in other scenarios as well.

I see this post, and hope this is not a big ask, but understand if there is hidden complexity that I can not for see

https://github.com/fabsig/GPBoost/issues/7

fabsig commented 2 years ago

@mikejacktzen Thank you for your suggestion.

It seems that this is a useful add-on that can be implemented with a manageable effort. May I ask, what is the approach of 'reg:logistic' in XGBoost? Is it the quasi-likelihood approach of Papker and Wooldridge (1996) (Equation (5))?

mikejacktzen commented 2 years ago

That sounds right. Is something similar to the quasi likelihood already what lightgbm uses? If so, that's even better, time / coding wise.

I'm hoping this request does not have to define a whole new objective, and instead can modify the existing objective.

Perhaps remove some of the safety checks that currently enforce 0/1 and allow for [0,1] to be used for the logistic (quasi) likelihood (if it's that easy). Using variants of the logistic loss, would be perfect.

I did some more digging around about XGboost, sounds like it does not make a substantive distinction between the two 'binary:logistic' vs 'reg:logistic' when it comes to the loss objective during training.

https://datascience.stackexchange.com/questions/9802/what-is-the-difference-in-xgboost-binarylogistic-and-reglogistic

For another example, the glmnet package in R allows for both the 0/1 vs [0,1] functionality using the same binomial(N=1,p) logistic likelihood internally.

mikejacktzen commented 2 years ago

I think i should have started my hunt using lightgbm examples!

Looks like lightgbm supports cross-entropy allowing for [0,1] labels with a loss similar to binomial(N=1,p) ~= quasi likelihood

https://lightgbm.readthedocs.io/en/latest/Parameters.html#core-parameters

Maybe enabling cross-entropy in gpboost would be the easiest route?

# in R

library(lightgbm)

X = matrix(rnorm(2*100), ncol=2)
b = c(2, -2)
eta = X%*%b - 1
p = plogis(eta)
n = rep(10, length(p))
# Simulate y wins in n games
y = rbinom(100, n, p)

# https://github.com/Microsoft/LightGBM/issues/1348

train_matrix = lgb.Dataset(data = X, label = y)
val_matrix = lgb.Dataset(data = X, label = y)
valid = list(test = val_matrix)

# 0/1 logistic logloss

params = list(max_bin = 5,
              learning_rate = 0.001,
              objective = "binary",
              metric = 'binary_logloss')
bst = lightgbm(params = params, train_matrix, valid, nrounds = 10)

# cross-entropy

outcome_prop = y/n
train_matrix = lgb.Dataset(data = X, label = outcome_prop)
val_matrix = lgb.Dataset(data = X, label = outcome_prop)
valid = list(test = val_matrix)

params = list(max_bin = 5,
              learning_rate = 0.001,
              objective = "cross_entropy")
bst = lightgbm(params = params, train_matrix, nrounds = 10)
fabsig commented 2 years ago

Thanks for the update. GPBoost cannot just use LightGBM losses as computations are very different, but it should still not be too much effort. I plan to implement this once I have time.

fabsig commented 2 years ago

@mikejacktzen: Sorry for the long silence. I finally started to have a look at this. It turns out that things are slightly more complicated than I first assumed. While simply implementing a Bernoulli quasi log-likelihood function, such as what LightGBM calls _crossentropy 🤨, is relatively simple from a software engineering point of view, it is currently unclear to me how to implement this in a sound way for models with random effects without doing something stupid from a statistical point of view. For instance, robust standard errors via a sandwich estimator need to be used when using a Bernoulli quasi log-likelihood function. See, e.g., this discussion and this one for more details in the GLMM case (yes, the GPBoost algorithm does not come with standard errors for the fixed effects, but the GPBoost library also allows for GLMMs, and it calculates approximate standard errors of (co-)variance parameters). It is currently not clear to me whether, e.g., lme4 even allows for a Bernoulli quasi log-likelihood (not a binomial one) for fractional response data in a GLMM context. Besides, a Bernoulli quasi log-likelihood function is only one of several ways for modeling fractional response data (others include distributions with support in [0,1] and the binomial distribution if the "number of trials" per observation are known).

Having said this, I have spent relatively little time on the topic. To implement something about which I am confident that it makes sense, I would need to spend more time on this. If, e.g., a collaboration for an article is an option for you, I would be willing to invest time into this. Let me know. Otherwise, this is currently on hold for me. It goes without saying that contributions are very welcome, but they should also come with a sound justification of what (and why) is being done.

mikejacktzen commented 1 year ago

Thanks for taking a look Fabio. I've come to realize that the things that seem easy are usually the hardest to achieve. I'll follow up in an email off line that ties together some of the feature request (including this one).