Closed toshiakiasakura closed 2 years ago
Thank you for using GPBoost!
When you define a grouped random effects model with GPModel()
, you need to provide the categorical variable(s) and not the "dummyfied" design matrix.
The following code gives approx. the same results as lme4:
import gpboost as gpb
from patsy import dmatrix
import statsmodels.api as sm
data = sm.datasets.get_rdataset("grouseticks", "lme4").data
group = data["LOCATION"]
model = gpb.GPModel(group_data=group, likelihood="poisson")
y = data["TICKS"]
X = dmatrix("C(YEAR)", data=data)
model.fit(y=y, X=X, params={"std_dev":True, "optimizer_cov": "nelder_mead"})
model.summary()
Output:
Covariance parameters:
LOCATION
Param. 1.646439
Linear regression coefficients:
Covariate_1 Covariate_2 Covariate_3
Param. 0.648816 0.943754 -1.436536
Std. dev. 0.182666 0.085208 0.123423
Note that I used "optimizer_cov": "nelder_mead"
in the fit
function instead of the default gradient descent optimizer as the Nelder-Mead optimizer converges to a slightly lower negative log-likelihood (which appears to be quite flat).
AIC and BIC values are currently not yet displayed. I will mark this issue with an enhancement label and include this as a feature in a future release.
With the following code, you can calculate the approximative negative log-likelihood from which you can readily calculate the AIC / BIC yourself:
cov_pars = model.get_cov_pars(format_pandas=False)
fixed_effects = X.dot(model.get_coef(format_pandas=False)[0])
model.neg_log_likelihood(y=y, cov_pars=cov_pars, fixed_effects=fixed_effects)
# Out[48]: 1149.5229837344111
Note that GPBoost uses a Laplace approximation, whereas you use adaptive Gauss-Hermite quadrature in your code for lme4. This is the reason for the differences in the log-likelihood.
Thank you for your rapid replies. However, the code proposed by you does not produce the same results.
I run the following code.
import gpboost as gpb
from patsy import dmatrix
import statsmodels.api as sm
data = sm.datasets.get_rdataset("grouseticks", "lme4").data
group = data["LOCATION"]
model = gpb.GPModel(group_data=group, likelihood="poisson")
y = data["TICKS"]
X = dmatrix("C(YEAR)", data=data)
model.fit(y=y, X=X, params={"std_dev":True, "optimizer_cov": "nelder_mead"})
model.summary()
and the result is
Covariance parameters:
LOCATION
Param. 1.901947
Linear regression coefficients:
Covariate_1 Covariate_2 Covariate_3
Param. 0.597959 0.982669 -1.375673
Std. dev. 0.194383 0.086156 0.123315
It seems, in my environment, the optimization is not peformed well...
And also, I tried to check the likelihood, I could not use model.neg_log_likelihood
because of the following error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [2], in <cell line: 3>()
1 cov_pars = model.get_cov_pars(format_pandas=False)
2 fixed_effects = X.dot(model.get_coef(format_pandas=False)[0])
----> 3 model.neg_log_likelihood(y=y, cov_pars=cov_pars, fixed_effects=fixed_effects)
TypeError: neg_log_likelihood() got an unexpected keyword argument 'fixed_effects'
My package environment is as follows
%load_ext watermark
%watermark -n -u -v -iv -w -p graphviz
Python implementation: CPython
Python version : 3.9.7
IPython version : 8.2.0
graphviz: not installed
statsmodels: 0.13.2
gpboost : 0.7.6.2
Watermark: 2.3.1
I reinstalled the package and now I am also getting the same results as you. However, if I make the convergence criterion more strict ("delta_rel_conv": 1e-12
), I can reproduce the results I reported above. The likelihood is very flat and I assume that these differences are due to rounding errors (although I am not 100% sure about it).
Further, starting from version 0.7.7, AIC and BIC values are included in the summary and the summary looks nicer. Note that the neg_log_likelihood
function did not yet support the argument fixed_effects
prior to version 0.7.7. But you don't need this anymore as the log-likelihood is reported in the summary now. In case you want the log-likelihood, you can now also directly call get_current_neg_log_likelihood()
.
With version 0.7.7, running the following code
import gpboost as gpb
from patsy import dmatrix
import statsmodels.api as sm
data = sm.datasets.get_rdataset("grouseticks", "lme4").data
group = data["LOCATION"]
y = data["TICKS"]
X = dmatrix("C(YEAR)", data=data)
model = gpb.GPModel(group_data=group, likelihood="poisson")
model.fit(y=y, X=X, params={"std_dev":True, "optimizer_cov": "nelder_mead",
"delta_rel_conv": 1e-12})
model.summary()
gives the the following output
===================================================
Model summary:
Log-lik AIC BIC
-1149.14448 2306.28896 2322.284706
Nb. observations: 403
Nb. groups: LOCATION
63
---------------------------------------------------
Covariance parameters (random effects):
Param.
LOCATION 1.6464
---------------------------------------------------
Linear regression coefficients (fixed effects):
Param. Std. dev. z value P(>|z|)
Covariate_1 0.6488 0.1827 3.5519 0.0004
Covariate_2 0.9437 0.0852 11.0758 0.0000
Covariate_3 -1.4365 0.1234 -11.6391 0.0000
===================================================
I got the same results!!! Thanks for your quick improvements!!!
@toshiakiasakura:
With the new version 0.7.8, the optimization now also works correctly with the default settings. In addition, the variable names are now correctly shown for patsy DataMatrix objects.
With version 0.7.8, running the following code
model = gpb.GPModel(group_data=group, likelihood="poisson")
model.fit(y=y, X=X, params={"std_dev":True})
model.summary()
gives the the following output
===================================================
Model summary:
Log-lik AIC BIC
-1149.15 2306.31 2322.3
Nb. observations: 403
Nb. groups: 63 (LOCATION)
---------------------------------------------------
Covariance parameters (random effects):
Param.
LOCATION 1.6367
---------------------------------------------------
Linear regression coefficients (fixed effects):
Param. Std. dev. z value P(>|z|)
Intercept 0.6730 0.1822 3.6944 0.0002
C(YEAR)[T.96] 0.9438 0.0852 11.0781 0.0000
C(YEAR)[T.97] -1.4351 0.1233 -11.6347 0.0000
===================================================
FWIW:
Wow! I am very surprised that you found my article even written in Japanese. I am very pleased with improvements of this package, and I will update my ariticle to include the latest information. Thanks a lot!
I tried to compare packages with R's lm4 packages. Then I checked grouseticks model with R like
The result became
However, when I implemented the same content with GPBoost in Python,
And I obtained like this
I hope that AIC and BIC will be implemented here, but at present, I wonder that there is missing for sd of Intercept (Covariate_1). Why is there discrepancy between lm4 and GPBoost?