fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
552 stars 43 forks source link

Standard deviations for Intercept is not calculated when glmm is applied. #68

Closed toshiakiasakura closed 2 years ago

toshiakiasakura commented 2 years ago

I tried to compare packages with R's lm4 packages. Then I checked grouseticks model with R like

library(lme4)
fit_GLMM <- glmer(TICKS ~ YEAR + (1 | LOCATION), family = "poisson", data = grouseticks, nAGQ = 25)

The result became

> summary(fit_GLMM)

Generalized linear mixed model fit by maximum likelihood
  (Adaptive Gauss-Hermite Quadrature, nAGQ = 25) [glmerMod]
 Family: poisson  ( log )
Formula: TICKS ~ YEAR + (1 | LOCATION)
   Data: grouseticks

     AIC      BIC   logLik deviance df.resid 
  1374.8   1390.8   -683.4   1366.8      399 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.6401 -0.8847 -0.4704  0.8549  5.5755 

Random effects:
 Groups   Name        Variance Std.Dev.
 LOCATION (Intercept) 1.662    1.289   
Number of obs: 403, groups:  LOCATION, 63

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.64907    0.18549   3.499 0.000467 ***
YEAR96       0.94352    0.08529  11.063  < 2e-16 ***
YEAR97      -1.43669    0.12362 -11.622  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
       (Intr) YEAR96
YEAR96 -0.293       
YEAR97 -0.238  0.521

However, when I implemented the same content with GPBoost in Python,

import gpboost as gpb
from patsy import dmatrix
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

data = sm.datasets.get_rdataset("grouseticks", "lme4").data
group = dmatrix("C(LOCATION)", data=data)
model = gpb.GPModel(group_data=group, likelihood="poisson")
y = data["TICKS"]
X = dmatrix("C(YEAR)", data=data)
model.fit(y=y, X=X, params={"std_dev":True})
model.summary()

And I obtained like this

Covariance parameters: 
         Group_1   Group_2   Group_3   Group_4  Group_5   Group_6  Group_7  \
Param.  2.225663  0.008534  1.804679  0.955073  0.04956  0.000844   2.7722   

         Group_8   Group_9  Group_10  ...   Group_54  Group_55  Group_56  \
Param.  0.024175  0.284696  0.067653  ...  76.499758  0.489269  1.625301   

        Group_57  Group_58   Group_59  Group_60  Group_61  Group_62  Group_63  
Param.  0.334557  21.34461  76.499758  0.181756  3.314848  1.879863  2.402065  

[1 rows x 63 columns]
Linear regression coefficients: 
           Covariate_1  Covariate_2  Covariate_3
Param.        1.910628     0.982417    -1.457926
Std. dev.          NaN     0.074574     0.115848

I hope that AIC and BIC will be implemented here, but at present, I wonder that there is missing for sd of Intercept (Covariate_1). Why is there discrepancy between lm4 and GPBoost?

fabsig commented 2 years ago

Thank you for using GPBoost!

When you define a grouped random effects model with GPModel(), you need to provide the categorical variable(s) and not the "dummyfied" design matrix.

The following code gives approx. the same results as lme4:

import gpboost as gpb
from patsy import dmatrix
import statsmodels.api as sm

data = sm.datasets.get_rdataset("grouseticks", "lme4").data
group = data["LOCATION"]
model = gpb.GPModel(group_data=group, likelihood="poisson")
y = data["TICKS"]
X = dmatrix("C(YEAR)", data=data)
model.fit(y=y, X=X, params={"std_dev":True, "optimizer_cov": "nelder_mead"})
model.summary()

Output:

Covariance parameters: 
        LOCATION
Param.  1.646439
Linear regression coefficients: 
           Covariate_1  Covariate_2  Covariate_3
Param.        0.648816     0.943754    -1.436536
Std. dev.     0.182666     0.085208     0.123423

Note that I used "optimizer_cov": "nelder_mead" in the fit function instead of the default gradient descent optimizer as the Nelder-Mead optimizer converges to a slightly lower negative log-likelihood (which appears to be quite flat).

AIC and BIC values are currently not yet displayed. I will mark this issue with an enhancement label and include this as a feature in a future release.

With the following code, you can calculate the approximative negative log-likelihood from which you can readily calculate the AIC / BIC yourself:

cov_pars = model.get_cov_pars(format_pandas=False)
fixed_effects = X.dot(model.get_coef(format_pandas=False)[0])
model.neg_log_likelihood(y=y, cov_pars=cov_pars, fixed_effects=fixed_effects)
# Out[48]: 1149.5229837344111

Note that GPBoost uses a Laplace approximation, whereas you use adaptive Gauss-Hermite quadrature in your code for lme4. This is the reason for the differences in the log-likelihood.

toshiakiasakura commented 2 years ago

Thank you for your rapid replies. However, the code proposed by you does not produce the same results.

I run the following code.

import gpboost as gpb
from patsy import dmatrix
import statsmodels.api as sm

data = sm.datasets.get_rdataset("grouseticks", "lme4").data
group = data["LOCATION"]
model = gpb.GPModel(group_data=group, likelihood="poisson")
y = data["TICKS"]
X = dmatrix("C(YEAR)", data=data)
model.fit(y=y, X=X, params={"std_dev":True, "optimizer_cov": "nelder_mead"})
model.summary()

and the result is

Covariance parameters: 
        LOCATION
Param.  1.901947
Linear regression coefficients: 
           Covariate_1  Covariate_2  Covariate_3
Param.        0.597959     0.982669    -1.375673
Std. dev.     0.194383     0.086156     0.123315

It seems, in my environment, the optimization is not peformed well... And also, I tried to check the likelihood, I could not use model.neg_log_likelihood because of the following error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [2], in <cell line: 3>()
      1 cov_pars = model.get_cov_pars(format_pandas=False)
      2 fixed_effects = X.dot(model.get_coef(format_pandas=False)[0])
----> 3 model.neg_log_likelihood(y=y, cov_pars=cov_pars, fixed_effects=fixed_effects)

TypeError: neg_log_likelihood() got an unexpected keyword argument 'fixed_effects'

My package environment is as follows

%load_ext watermark
%watermark -n -u -v -iv -w -p graphviz

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 8.2.0

graphviz: not installed

statsmodels: 0.13.2
gpboost    : 0.7.6.2

Watermark: 2.3.1
fabsig commented 2 years ago

I reinstalled the package and now I am also getting the same results as you. However, if I make the convergence criterion more strict ("delta_rel_conv": 1e-12), I can reproduce the results I reported above. The likelihood is very flat and I assume that these differences are due to rounding errors (although I am not 100% sure about it).

Further, starting from version 0.7.7, AIC and BIC values are included in the summary and the summary looks nicer. Note that the neg_log_likelihood function did not yet support the argument fixed_effects prior to version 0.7.7. But you don't need this anymore as the log-likelihood is reported in the summary now. In case you want the log-likelihood, you can now also directly call get_current_neg_log_likelihood().

With version 0.7.7, running the following code

import gpboost as gpb
from patsy import dmatrix
import statsmodels.api as sm

data = sm.datasets.get_rdataset("grouseticks", "lme4").data
group = data["LOCATION"]
y = data["TICKS"]
X = dmatrix("C(YEAR)", data=data)
model = gpb.GPModel(group_data=group, likelihood="poisson")
model.fit(y=y, X=X, params={"std_dev":True, "optimizer_cov": "nelder_mead", 
                            "delta_rel_conv": 1e-12})
model.summary()

gives the the following output

===================================================
Model summary:
    Log-lik        AIC         BIC
-1149.14448 2306.28896 2322.284706
Nb. observations: 403
Nb. groups:  LOCATION
       63
---------------------------------------------------
Covariance parameters (random effects):
          Param.
LOCATION  1.6464
---------------------------------------------------
Linear regression coefficients (fixed effects):
             Param.  Std. dev.  z value  P(>|z|)
Covariate_1  0.6488     0.1827   3.5519   0.0004
Covariate_2  0.9437     0.0852  11.0758   0.0000
Covariate_3 -1.4365     0.1234 -11.6391   0.0000
===================================================
toshiakiasakura commented 2 years ago

I got the same results!!! Thanks for your quick improvements!!!

fabsig commented 2 years ago

@toshiakiasakura:

With the new version 0.7.8, the optimization now also works correctly with the default settings. In addition, the variable names are now correctly shown for patsy DataMatrix objects.

With version 0.7.8, running the following code

model = gpb.GPModel(group_data=group, likelihood="poisson")
model.fit(y=y, X=X, params={"std_dev":True})
model.summary()

gives the the following output

===================================================
Model summary:
 Log-lik     AIC    BIC
-1149.15 2306.31 2322.3
Nb. observations: 403
Nb. groups: 63 (LOCATION)
---------------------------------------------------
Covariance parameters (random effects):
          Param.
LOCATION  1.6367
---------------------------------------------------
Linear regression coefficients (fixed effects):
               Param.  Std. dev.  z value  P(>|z|)
Intercept      0.6730     0.1822   3.6944   0.0002
C(YEAR)[T.96]  0.9438     0.0852  11.0781   0.0000
C(YEAR)[T.97] -1.4351     0.1233 -11.6347   0.0000
===================================================

FWIW:

toshiakiasakura commented 2 years ago

Wow! I am very surprised that you found my article even written in Japanese. I am very pleased with improvements of this package, and I will update my ariticle to include the latest information. Thanks a lot!