fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
574 stars 46 forks source link

Offset term in Poisson regression #142

Closed hayato-n closed 5 months ago

hayato-n commented 5 months ago

Hi, I tried to estimate Poisson GLMM with an offset term, and I suspect that the offset term may not work in estimation.

I conducted an easy simulation as follows with gpboost=1.5.0.1:

import numpy as np
import gpboost as gpb
import matplotlib.pyplot as plt

print(gpb.__version__)

n = 100
a, b = 2, 1
rng = np.random.default_rng(123)

exposure = rng.choice([10, 20, 30], size=n)
x = rng.normal(size=n)
mu = np.exp(a + b * x) * exposure
y = rng.poisson(mu)
X = np.stack([np.ones(n), x]).T

model = gpb.GPModel(group_data=np.ones(n, dtype=int), likelihood="poisson")
model.fit(y=y, X=X, offset=np.log(exposure))
print(model.summary())

pred1 = model.predict(X_pred=X, group_data_pred=np.ones(n, dtype=int))
pred2 = model.predict(
    X_pred=X, group_data_pred=np.ones(n, dtype=int), offset_pred=np.log(exposure)
)

plt.axis("equal")
plt.scatter(y, pred1["mu"], label="1", s=2)
plt.scatter(y, pred2["mu"], label="2", s=2)
plt.scatter(y, pred2["mu"] / exposure, label="2 / exposure", s=2)
plt.legend()
plt.show()

I expected that pred2 would become an appropriate prediction because its offset term is specified in the same manner as model.fit. However, pred1 and pred2 / exposure provided accurate predictions.

Do I misunderstand how I specify the offset term?

fabsig commented 5 months ago

Thanks a lot for using GPBoost and for reporting this issue!

There are two offsets: one for training and one for prediction. I.e., one that is added to the linear predictor during estimation and one that is added to the linear predictor for prediction. The current implementation for handling the offset for training is such that this needs to be passed again when calling the predict function (in addition to a potential offset_pred for the prediction points). This is arguably not very user-friendly. I thought that "nobody" was going to use this feature anyway. Turns out I was wrong.

I have now fixed this that the offset (that was passed during estimation) is saved internally. It is on GitHub but not yet on PyPI / CRAN. What you can do in the meantime is just passing the offset again when you call the predict function:

pred2 = model.predict(
    X_pred=X, group_data_pred=np.ones(n, dtype=int), 
    offset=np.log(exposure), offset_pred=np.log(exposure)
)

I hope this helps.

hayato-n commented 5 months ago

Thank you for your quick response, @fabsig ! The specification you showed worked.

offset term is really useful for social observation data, such as economic data or official statistics. It is because the exposures toward risks are not uniform in usual. For example, when we use burglary count data, the number of houses are not uniform. Hence we want to control it with offset term. Thus, I was really grad to find that this package supports offset term.

fabsig commented 5 months ago

Glad to hear that it worked. And thanks for the explanation about the use of the offset term!