fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
530 stars 42 forks source link

Using gamma in GPBoost in Python, version 1.2.7.1 (from PyPi) #130

Closed m-haines closed 4 months ago

m-haines commented 4 months ago

This might be a bug, or more likely a user error. I am attempting to fit a GPBoost model on some heavily right skewed data, using a gamma likelihood. With the intent to compare the output to a Box Cox transformed dataset, fitted using the Gaussian likelihood. I have attempted to look through the code in this repository to try and resolve this issue, but I have been unsuccessful.

import gpboost as gpd

gp_model = gpb.GPModel(gp_coords=coords_train.transpose(), cov_function="exponential",
                likelihood="gamma", gp_approx="vecchia")

x_train = x_train.drop(["Northings", "Eastings"], axis=1)
data_train = gpb.Dataset(x_train, y_train)

params = { 'lambda_l2': 1, 'learning_rate': 0.01,
            'max_depth': 3, 'min_data_in_leaf': 20, 
            'num_leaves': 2**10, 'verbose': 0}

mod = gpb.train(params=params, train_set=data_train,
            gp_model=gp_model, num_boost_round=247)

Whenever I run the model with these parameters I get the error:

gpboost.basic.GPBoostError: Check failed: aux_pars[0] > 0 at C:\Users\whsigris\Dropbox\HSLU\Projects\MixedBoost\GPBoost\python-package\compile\include\GPBoost/likelihoods.h, line 356

Is this because a gamma likelihood is only supported for a pure Gaussian process/ maximum likelihood at this time? As I can see options to set the gamma shape in GPBoost.fit (init_aux_pars), but not GPBoost.train.

Or, (more likely), is there a step I am missing, in setting the parameters for gamma? For example, gamma shape? If I am missing something, please could you advise how to set the model parameters for gamma? I am not entirely sure, but I think it might be something to do with set_optim_params , or manually setting init_aux_pars in the object. However, I am not sure what form the gamma shape parameters take within set_optim_params or init_aux_pars.

Thank you for any help, and thank you for writing the software!

fabsig commented 4 months ago

Thanks a lot for using GPBoost and for reporting this!

Is this because a gamma likelihood is only supported for a pure Gaussian process/ maximum likelihood at this time? As I can see options to set the gamma shape in GPBoost.fit (init_aux_pars), but not GPBoost.train.

Yes, gpb.train, i.e. GPBoost algorithm, can also be used with a gamma likelihood.

Or, (more likely), is there a step I am missing, in setting the parameters for gamma? For example, gamma shape? If I am missing something, please could you advise how to set the model parameters for gamma?

No, you are not missing something and you do not need to set anything additional. The gamma shape parameter should be learned automatically with an internal data-dependent initial value.

In conclusion, it seems your doing everything correct. This makes me wonder whether this is a numerical overflow and/or a problem with the internal default value for the shape parameter. Could you share the data (private email is also OK) or provide a reproducible example with simulated data?

What you can try is setting the initial value for the shape parameter yourself using the following code before calling lgb.train:

gp_model.set_optim_params(params={"init_aux_pars": [1]})
m-haines commented 4 months ago

Thank you for the information on setting the shape parameter, that is very helpful.

However, the error was entirely my own (as I suspected), and is very obvious. As soon as you mentioned numerical overflow it struck me that I may not have handled zero data correctly, and sure enough, that appears to be the issue.

Thanks again for your prompt reply.

fabsig commented 4 months ago

Glad to hear. GPBoost checks for 0 or negative values in the response variable when having a gamma likelihood, but I assume in your case they were very small but strictly positive.

m-haines commented 4 months ago

Just another point for anyone else who might one day find this. Don't use a gamma distribution if zero values are important in your dataset. Choose an alternative instead, such as negative binomial or Poisson.

Very roughly if your variance is low, Poisson might be better. If your variance is much higher than the mean, then try negative binomial as it has an additional term to account for excess variance. If anyone else ever has anything else to add, then please do comment as I am always keen to learn.

m-haines commented 4 months ago

Glad to hear. GPBoost checks for 0 or negative values in the response variable when having a gamma likelihood, but I assume in your case they were very small but strictly positive.

Yes, it was that case I had forgotten to check for. I should have known better as I have encountered the exact same floating point issue and positive definiteness problem in the past.