IV2SLS `first_stage` reports "wrong" first stage F-statistic

If there are multiple endogenous variables, IV2SLS.first_stage reports the F-statistics when regressing each component of the endogenous variables on the instruments (and controls). This is misleading. If the endogenous variables are correlated, the individual F-statistics can be large, while the causal parameter is not well identified.

See the following example:

In [1]: from linearmodels.iv import IV2SLS
   ...: import numpy as np
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: 
   ...: n = 1000
   ...: 
   ...: Z = rng.normal(size=(n, 3))
   ...: 
   ...: H = rng.normal(size=(n, 3))  # confounder
   ...: X = Z @ np.ones((3, 2)) + H @ np.array([[1, 0], [0, -1], [0, 0]])
   ...: y = H @ np.array([1, 1, 0.1])  # beta = 0
   ...: 
   ...: tsls = IV2SLS(y, None, X, Z).fit(cov_type="unadjusted")
   ...: print(tsls.first_stage)
         First Stage Estimation Results         
================================================
                              endog.0    endog.1
------------------------------------------------
R-squared                      0.7256     0.7364
Partial R-squared              0.7256     0.7364
Shea's R-squared               0.0010     0.0010
Partial F-statistic            881.59     931.31
P-value (Partial F-stat)     1.11e-16   1.11e-16
Partial F-stat Distn         F(3,997)   F(3,997)
========================== ========== ==========
instruments.0                  1.0079     0.9792
                             (31.124)   (31.102)
instruments.1                  0.9454     0.9773
                             (28.305)   (30.095)
instruments.2                  0.9673     0.9655
                             (30.639)   (31.453)
------------------------------------------------

T-stats reported in parentheses
T-stats use same covariance type as original model

The individual F-statistics are large, suggesting that Wald-based confidence sets can be trusted. They cannot.

In [2]: tsls
Out[2]: 
                          IV-2SLS Estimation Summary                          
==============================================================================
Dep. Variable:              dependent   R-squared:                      0.9775
Estimator:                    IV-2SLS   Adj. R-squared:                 0.9775
No. Observations:                1000   F-statistic:                    29.778
Date:                Fri, Oct 04 2024   P-value (F-stat)                0.0000
Time:                        16:48:29   Distribution:                  chi2(2)
Cov. Estimator:            unadjusted                                         

                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
endog.0        0.8687     0.1592     5.4558     0.0000      0.5566      1.1807
endog.1       -0.8694     0.1593    -5.4568     0.0000     -1.1817     -0.5571
==============================================================================

Endogenous: endog.0, endog.1
Instruments: instruments.0, instruments.1, instruments.2
Unadjusted Covariance (Homoskedastic)
Debiased: False
IVResults, id: 0x15b26a510

Even though the true parameter is zero, the F-statistic is highly significant at ~30. So are the t-statistics.

In Testing for Weak Instruments in Linear IV Regression (2005), Stock and Yogo suggest to use the Cragg and Donald statistic for reduced rank to test for identification. If $P_Z$ is the projection onto the column span of $Z$, and $MZ$ the projection onto the orthogonal column span, the statistic is $n \cdot \lambda\mathrm{min}\left( (X^T M_Z X)^{-1} X^T P_Z X \right).$ In the case of a single endogenous variable, this is the F-statistic. Else, it takes the correlation of the columns of $\Pi$ in $X = Z \Pi + V$ into account. In Table 1, they report thresholds for the statistic, similarly to the first-stage F-test heuristic based on Staiger and Stock (1997).

In the example above, the Cragg and Donald test statistic is very small, correctly suggesting that Wald-based inference cannot be trusted.

In [3]: from ivmodels.tests import rank_test
   ...:
   ...: statistic, p_value = rank_test(Z, X, fit_intercept=False)
   ...: print(f"{statistic=}, {p_value=}")
statistic=np.float64(0.8939161043879634), p_value=np.float64(0.6395707363012899)

bashtage / linearmodels

IV2SLS `first_stage` reports "wrong" first stage F-statistic #622