Closed mlondschien closed 7 months ago
You shouldn't have a constant in the endogenous. Constants are always exogenous.
import numpy as np
from linearmodels.iv import IVLIML
x,e,z = np.random.multivariate_normal([0,0,0],[[1,0.5,0.5],[0.5,1,0],[0.5,0,1]],size=(1,500)).T
y = 1 + x + e
const = np.ones_like(x)
IVLIML(y,const,x,z).fit()
Out[1]:
IV-LIML Estimation Summary
==============================================================================
Dep. Variable: dependent R-squared: 0.6537
Estimator: IV-LIML Adj. R-squared: 0.6530
No. Observations: 500 F-statistic: 81.456
Date: Tue, Apr 23 2024 P-value (F-stat) 0.0000
Time: 08:27:28 Distribution: chi2(1)
Cov. Estimator: robust
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
exog 0.9909 0.0453 21.853 0.0000 0.9021 1.0798
endog 0.9744 0.1080 9.0253 0.0000 0.7628 1.1860
==============================================================================
Endogenous: endog
Instruments: instruments
Robust Covariance (Heteroskedastic)
Debiased: False
Kappa: 1.000
IVResults, id: 0x158414c7d10
Closing as answered, but fell free to comment if not clear.
Thanks.
How do I specify whether the first stage has a constant or not?
Exogenous variables are always included in the first stage.
In short hand, the first stage is W = X + Z where W are the endogenous, X are exogenous and Z are instruments. If you don't want a constant in the 1st state, just use
IVLIML(y,None,x,z)
This worked in your code above.
Thanks. But this gives the wrong result, as there is a constant in the second stage. How do I fit a model with a constant in the second stage but none in the first? I see this is a bit pedantic.
What you are trying to do is statistically unsound. Why do you think you need to do this?
When you include the constant as exogenous (which it is), then the two models fit are
W = X + Z Y = X + W-hat
W-hat is a linear combination of X and Z, but X is still there in the second stage.
This is identical to fitting
X + W = X + Z Y = X-hat + W-hat
in this contrived example, X-hat is trivally equal to X since the best exogenous predictor of X is just X.
I've thought and enquired about this a bit more. Here are two relevant references:
The Stata FAQ states that in a triangular simultaneous equation model as in IV, not including all exogenous regressors in the first stage is unbiased but not supported by ivregress
. The Stackexchange answer states that excluding exogenous regressors $X$ in the first stage $Z \rightarrow T$ leads to bias in the causal parameter $T \rightarrow Y$ if there is a link (confounding) $Z \leftrightarrow X$ and $T \leftrightarrow X$.
That is, there are settings where it is statistically sound to exclude exogenous regressors from the first stage, given prior (domain) knowledge. The setting where the exogenous regressor is an intercept is one such setting.
The Stata FAQ states that in a triangular simultaneous equation model as in IV, not including all exogenous regressors in the first stage is unbiased but not supported by
ivregress
.
This is correct but it is statistically unsound* to do this. The reason is that using only the instruments leads to a larger error variance than when using both the instruments and the exogenous regressors are included and leads to a lower R2 in the first stage. This in turn leads to a second stage fit that is less correlated with the true effect and so larger standard errors.
The Stackexchange answer states that excluding exogenous regressors in the first stage leads to bias in the causal parameter if there is a link (confounding) and
I don't think this supports excluding them. The conclusion is that they either need to be included or can optionally be excluded (under additional orthogonality assumptions, which coincides with the triangular explanation from Stats, but you might as well include them).
In general, if you are in a case where including exogenous regressors leads to bias (really inconsistency, since we don't have unbiasedness in IV in general), then it becomes a different problem (one of limiting bias, rather than having valid IV estimates).
The reason is that using only the instruments leads to a larger error variance than when using both the instruments and the exogenous regressors are included
Do you mean the error variance of the estimate of the causal parameter? If so, would you happen to have a reference for this statement?
I don't think this supports excluding them.
This does not make a statement about whether excluding them is better than including them, yes.
In general, if you are in a case where including exogenous regressors leads to bias (really inconsistency, since we don't have unbiasedness in IV in general), then it becomes a different problem (one of limiting bias, rather than having valid IV estimates).
To my understanding, if the regressors are exogenous, including them should never lead to inconsistency, correct? Are there settings where in / excluding them in the first stage improves / worsens (asymptotic) efficiency of the estimator (assuming consistency)? There are settings where excluding exogenous regressors from the first stage does improve asymptotic efficiency.
Angrist and Krueger (1995), Tables IV - VI, columns (4), (8) are examples where exogenous variables (ageq
and ageqsq
) are included in the regression, but not as an instrument. See last lines of
https://economics.mit.edu/sites/default/files/inline-files/QOB%20Table%20IV.do
Minimal reproducible example:
Is this expected behaviour? How do I specify whether the first or second stage has an intercept?
In
the
kappa
parameter would be 1 as there are as many instruments as endogenous variables.