fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
574 stars 46 forks source link

Feature Request: Is it possible to support an offset parameter? #123

Closed ggruenhagen3 closed 10 months ago

ggruenhagen3 commented 11 months ago

Is it possible to support an offset parameter like you can for lme4::glmer? This is an important parameter for differential gene expression analysis, where an offset is used for the known size factors. It would be really great if this could be done in gpboost.

An example of the offset parameter in lme4::glmer: lme4::glmer(count ~ cond + (1|subject), data = df, offset = log(size_factors)) or lme4::glmer(count ~ cond + offset( log(size_factors)) + (1|subject), data = df)

fabsig commented 11 months ago

Thank you for for the suggestion!

I will add this. The way I understand this, an offset is just a sample-specific constant that you add to the linear predictor (= sum of fixed and random effects). Correct me if I am wrong. lme4 also has the option that "One or more offest terms can be included in the formula", but I guess one offset is enough as a user can calculate the sum before passing this. I.e., an offset will be a vector of length the number of data points.

ggruenhagen3 commented 11 months ago

Awesome, thank you! Yes, that is my understanding too.

ggruenhagen3 commented 10 months ago

Hi @fabsig, is it looking like adding this feature is possible? Or have you run into roadblocks? Thank you so much and have a great day! 😃

fabsig commented 10 months ago

Lots of other work... Will add it soon (hopefully within 1-2 weeks). Thanks for your patience.

fabsig commented 10 months ago

The offset feature is now implemented and on GitHub (not yet on CRAN). You can pass an offset via the fixed_effects parameter of the fit function of GLMMs. For instance in R:

gp_model <- fitGPModel(group_data = group, likelihood = "bernoulli_probit",
                       y = y, X = X, fixed_effects = offset)

The only caveat is that, currently, if you call the predict function , you need to pass the same (training data) offset again, e.g.:

pred <- predict(gp_model, group_data_pred = group_test, X_pred = X_test, 
                predict_response = FALSE, fixed_effects = offset)

This is, admittedly, not ideal from a user experience point of view, and it could be fixed. But since this feature is likely used only relatively rarely and my time is limited, I currently leave it like that.

If you want to use an offset in the GPBoost algorithm, I suggest you use the init_score argument:

dtrain <- lgb.Dataset(X, label = label)
set_field(dtrain, "init_score", offset)

The latter is functionality that is inherited from LightGBM (I have not tested it).