bashtage / linearmodels

Additional linear models including instrumental variable and panel data models that are missing from statsmodels.
https://bashtage.github.io/linearmodels/
University of Illinois/NCSA Open Source License
950 stars 184 forks source link

How to fit LIML with 1 endogenous variable, 1 instrument, including an intercept #596

Closed mlondschien closed 7 months ago

mlondschien commented 7 months ago

Minimal reproducible example:

Python 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:41) [Clang 15.0.7 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np
   ...: import statsmodels.api as sm
   ...: from linearmodels.iv import IV2SLS, IVLIML
   ...: 
   ...: n = 100
   ...: 
   ...: np.random.seed(0)
   ...: 
   ...: Z = np.random.normal(size=(n, 1))
   ...: X = np.random.normal(size=(n, 1))
   ...: beta = np.random.normal(size=1)
   ...: y = X @ beta + np.random.normal(scale=1, size=n) + 1
   ...: 
   ...: IVLIML(y, None, X, Z).fit().params
Out[1]: 
endog    0.221107
Name: parameter, dtype: float64

In [2]: IVLIML(y, None, sm.add_constant(X), Z).fit().params

[ ... ]

ValueError: The number of instruments (1) must be at least as large as the number of endogenous regressors (2).

In [3]: IVLIML(y, None, sm.add_constant(X), sm.add_constant(Z)).fit().params

[ ... ]

ValueError: Unable to estimate kappa. This is most likely occurs if the instrument matrix is rank deficient. The error raised when computing kappa was:

Eigenvalues did not converge

In [4]:  IVLIML(y, sm.add_constant(np.zeros((n, 0))), X, Z).fit().params
Out[4]: 
exog     0.920155
endog   -0.232140
Name: parameter, dtype: float64

Is this expected behaviour? How do I specify whether the first or second stage has an intercept?

In

In [3]: IVLIML(y, None, sm.add_constant(X), sm.add_constant(Z)).fit().params

the kappa parameter would be 1 as there are as many instruments as endogenous variables.

bashtage commented 7 months ago

You shouldn't have a constant in the endogenous. Constants are always exogenous.

import numpy as np
from linearmodels.iv import IVLIML

x,e,z = np.random.multivariate_normal([0,0,0],[[1,0.5,0.5],[0.5,1,0],[0.5,0,1]],size=(1,500)).T
y = 1 + x + e
const = np.ones_like(x)

IVLIML(y,const,x,z).fit()
Out[1]:
                          IV-LIML Estimation Summary
==============================================================================
Dep. Variable:              dependent   R-squared:                      0.6537
Estimator:                    IV-LIML   Adj. R-squared:                 0.6530
No. Observations:                 500   F-statistic:                    81.456
Date:                Tue, Apr 23 2024   P-value (F-stat)                0.0000
Time:                        08:27:28   Distribution:                  chi2(1)
Cov. Estimator:                robust

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exog           0.9909     0.0453     21.853     0.0000      0.9021      1.0798
endog          0.9744     0.1080     9.0253     0.0000      0.7628      1.1860
==============================================================================

Endogenous: endog
Instruments: instruments
Robust Covariance (Heteroskedastic)
Debiased: False
Kappa: 1.000
IVResults, id: 0x158414c7d10
bashtage commented 7 months ago

Closing as answered, but fell free to comment if not clear.

mlondschien commented 7 months ago

Thanks.

How do I specify whether the first stage has a constant or not?

bashtage commented 7 months ago

Exogenous variables are always included in the first stage.

bashtage commented 7 months ago

In short hand, the first stage is W = X + Z where W are the endogenous, X are exogenous and Z are instruments. If you don't want a constant in the 1st state, just use

IVLIML(y,None,x,z)

This worked in your code above.

mlondschien commented 7 months ago

Thanks. But this gives the wrong result, as there is a constant in the second stage. How do I fit a model with a constant in the second stage but none in the first? I see this is a bit pedantic.

bashtage commented 7 months ago

What you are trying to do is statistically unsound. Why do you think you need to do this?

When you include the constant as exogenous (which it is), then the two models fit are

W = X + Z Y = X + W-hat

W-hat is a linear combination of X and Z, but X is still there in the second stage.

This is identical to fitting

X + W = X + Z Y = X-hat + W-hat

in this contrived example, X-hat is trivally equal to X since the best exogenous predictor of X is just X.

mlondschien commented 7 months ago

I've thought and enquired about this a bit more. Here are two relevant references:

Stata FAQ StackExchange

The Stata FAQ states that in a triangular simultaneous equation model as in IV, not including all exogenous regressors in the first stage is unbiased but not supported by ivregress. The Stackexchange answer states that excluding exogenous regressors $X$ in the first stage $Z \rightarrow T$ leads to bias in the causal parameter $T \rightarrow Y$ if there is a link (confounding) $Z \leftrightarrow X$ and $T \leftrightarrow X$.

That is, there are settings where it is statistically sound to exclude exogenous regressors from the first stage, given prior (domain) knowledge. The setting where the exogenous regressor is an intercept is one such setting.

bashtage commented 7 months ago

The Stata FAQ states that in a triangular simultaneous equation model as in IV, not including all exogenous regressors in the first stage is unbiased but not supported by ivregress.

This is correct but it is statistically unsound* to do this. The reason is that using only the instruments leads to a larger error variance than when using both the instruments and the exogenous regressors are included and leads to a lower R2 in the first stage. This in turn leads to a second stage fit that is less correlated with the true effect and so larger standard errors.

The Stackexchange answer states that excluding exogenous regressors in the first stage leads to bias in the causal parameter if there is a link (confounding) and

I don't think this supports excluding them. The conclusion is that they either need to be included or can optionally be excluded (under additional orthogonality assumptions, which coincides with the triangular explanation from Stats, but you might as well include them).

In general, if you are in a case where including exogenous regressors leads to bias (really inconsistency, since we don't have unbiasedness in IV in general), then it becomes a different problem (one of limiting bias, rather than having valid IV estimates).

mlondschien commented 7 months ago

The reason is that using only the instruments leads to a larger error variance than when using both the instruments and the exogenous regressors are included

Do you mean the error variance of the estimate of the causal parameter? If so, would you happen to have a reference for this statement?

I don't think this supports excluding them.

This does not make a statement about whether excluding them is better than including them, yes.

In general, if you are in a case where including exogenous regressors leads to bias (really inconsistency, since we don't have unbiasedness in IV in general), then it becomes a different problem (one of limiting bias, rather than having valid IV estimates).

To my understanding, if the regressors are exogenous, including them should never lead to inconsistency, correct? Are there settings where in / excluding them in the first stage improves / worsens (asymptotic) efficiency of the estimator (assuming consistency)? There are settings where excluding exogenous regressors from the first stage does improve asymptotic efficiency.

mlondschien commented 4 months ago

Angrist and Krueger (1995), Tables IV - VI, columns (4), (8) are examples where exogenous variables (ageq and ageqsq) are included in the regression, but not as an instrument. See last lines of https://economics.mit.edu/sites/default/files/inline-files/QOB%20Table%20IV.do