Quantco / glum

High performance Python GLMs with all the features!
https://glum.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
300 stars 24 forks source link

Question: Differences in VC matrix #823

Closed mayer79 closed 1 month ago

mayer79 commented 1 month ago

The variance-covariance estimates (and the standard errors derived from their diagonal) can differ by quite some margin (10-20%) from those of statsmodels, see the example below. Is this as expected?

If I am not wrong, glum uses n/(n-p-1) times the inverse of the Fisher information matrix. Not sure what statsmodels is doing, but it seems to be identical to R's glm(), which uses "dispersion * (X'X)^-1", where the inverse is calculated from the R part of the QR decomposition, and dispersion depends on the family. For the normal family, it is sum(residuals^2) / (n-p-1).

Glum

import numpy as np
from glum import GeneralizedLinearRegressor
import statsmodels.api as sm

n = 100
rng = np.random.default_rng(0)
X = rng.standard_normal((n, 5))
y = (X[:, 0] > 0) + rng.uniform(size=n)

# Glum
model = GeneralizedLinearRegressor(alpha=0).fit(X, y)
_ = model.covariance_matrix(X=X, y=y, store_covariance_matrix=True)
model.coef_table()

image

Statsmodels

# Statsmodels
model2 = sm.GLM(y, sm.add_constant(X)).fit()
model2.summary()

image

R

library(arrow)

X <- read_parquet("X.parquet")
y <- read_parquet("y.parquet")

fit <- glm(y ~ ., data = data.frame(y = y[[1]], X))
summary(fit)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.99812    0.04283  23.304  < 2e-16 ***
X0           0.46044    0.04666   9.867 3.52e-16 ***
X1           0.02333    0.04118   0.567   0.5724    
X2           0.09017    0.04593   1.963   0.0526 .  
X3          -0.02790    0.03703  -0.753   0.4530    
X4          -0.05887    0.03817  -1.543   0.1263    

Versions:

lbittarello commented 1 month ago

Thanks for the issue and for the MRE!

Glum and Statsmodels produce the same standard errors, except for a correction for the degrees of freedom in the model. Glum uses n / (n - k - 1), where k is the number of features in the model, whereas Statsmodels doesn't correct homoscedastic standard errors at all. To verify this:

import glum
import numpy as np
import statsmodels.api as sm

n = 100
rng = np.random.default_rng(0)

X = rng.standard_normal((n, 5))
y = (X[:, 0] > 0) + rng.uniform(size=n)

dof_corr = len(y) / (len(y) - X.shape[1] - 1)

glum_se = (
    glum.GeneralizedLinearRegressor(alpha=0)
    .fit(X, y)
    .std_errors(X=X, y=y, robust=False)
)

sm_se = sm.GLM(y, sm.add_constant(X)).fit(cov_type="nonrobust").bse

np.testing.assert_array_almost_equal(glum_se, sm_se * np.sqrt(dof_corr))

The standard errors are equal up to fifteen digits. :)

mayer79 commented 1 month ago

You are faster than the wind, thanks! I have also added the results from R's glm().

If I want to match the result of statsmodels, I'd thus need to multiply the VC matrix with 1/dof_corr. Nice!

mayer79 commented 1 month ago

@lbittarello I am probably making a very stupid mistake, but why are these standard errors again different from statsmodels?

VC = model.covariance_matrix(X=X, y=y, store_covariance_matrix=True)
np.sqrt(VC.diagonal() / dof_corr)

# array([0.0434066, 0.0405607 , 0.03731663, 0.04683188, 0.03192106, 0.03436726])
lbittarello commented 1 month ago

Glum computes robust standard errors by default. You should get the same SEs with

VC = model.covariance_matrix(X=X, y=y, store_covariance_matrix=True, robust=False)
np.sqrt(VC.diagonal() / dof_corr)
mayer79 commented 1 month ago

Ahaaaa, thanks again!

lbittarello commented 1 month ago

I'm closing this issue as resolved, but feel free to reopen if you have more questions!