FixedEffects / FixedEffectModels.jl

Fast Estimation of Linear Models with IV and High Dimensional Categorical Variables
Other
227 stars 46 forks source link

Standard errors for IV with multiple endogenous and exogenous variables #150

Closed eloualiche closed 3 years ago

eloualiche commented 3 years ago

I have encountered something puzzling while trying to estimate standard errors of a 2SLS regression.

I have tried to include a reproducible example.

using DataFrames, Random
Random.seed!(107)

N_id   = 2;
N_rows = 10;
i = 1
df_example = DataFrame()
for i in 1:N_id
  obs = 1:N_rows
  id = i .* Int.(ones(N_rows))
  Z = rand(N_rows)
  X = Z + 0.5 .* rand(N_rows)
  Y = 3 .* Z + 0.3 .* rand(N_rows)
  df_tmp = DataFrame(obs=obs, id = id, Z = Z, X = X, Y = Y)
  df_example = vcat(df_example, df_tmp)
end  

df_example1 = unstack(select(df_example, :obs, :id, :X), [:obs, :id, :X], :id, :X, renamecols=x->Symbol(:X, Int(x)))
df_example2 = unstack(select(df_example, :obs, :id, :Z), [:obs, :id, :Z], :id, :Z, renamecols=x->Symbol(:Z, Int(x)))
df_example = innerjoin(df_example, select(df_example1, :obs, :id, :X1, :X2), on = [:obs, :id])
df_example = innerjoin(df_example, select(df_example2, :obs, :id, :Z1, :Z2), on = [:obs, :id])

df_example[ ismissing.(df_example.X1), :X1] .= 0.0;
df_example[ ismissing.(df_example.X2), :X2] .= 0.0;
df_example[ ismissing.(df_example.Z1), :Z1] .= 0.0;
df_example[ ismissing.(df_example.Z2), :Z2] .= 0.0;
transform!(df_example, :id => categorical => :id)

Our dataset looks like:

20×9 DataFrame
 Row │ obs    id    Z            X         Y         X1        X2        Z1        Z2
     │ Int64  Cat…  Float64      Float64   Float64   Float64?  Float64?  Float64?  Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────────────
   1 │     1  1     0.392696     0.626766  1.47188   0.626766  0.0       0.392696  0.0
   2 │     2  1     0.760011     1.07027   2.48113   1.07027   0.0       0.760011  0.0
   3 │     3  1     0.923533     1.23964   2.86144   1.23964   0.0       0.923533  0.0
   4 │     4  1     0.490813     0.815537  1.70243   0.815537  0.0       0.490813  0.0
   5 │     5  1     0.617937     0.88543   1.85978   0.88543   0.0       0.617937  0.0
   6 │     6  1     0.556129     0.834239  1.78295   0.834239  0.0       0.556129  0.0
   7 │     7  1     0.710606     0.831722  2.38318   0.831722  0.0       0.710606  0.0
   8 │     8  1     0.902673     1.35225   2.89868   1.35225   0.0       0.902673  0.0
   9 │     9  1     0.322105     0.819778  1.04217   0.819778  0.0       0.322105  0.0
  10 │    10  1     0.313981     0.56277   1.11195   0.56277   0.0       0.313981  0.0
  11 │     1  2     0.943379     1.31866   3.04883   0.0       1.31866   0.0       0.943379
  12 │     2  2     0.663099     0.719519  2.16304   0.0       0.719519  0.0       0.663099
  13 │     3  2     0.298663     0.788106  0.934513  0.0       0.788106  0.0       0.298663
  14 │     4  2     0.766435     1.15822   2.37755   0.0       1.15822   0.0       0.766435
  15 │     5  2     0.182493     0.59385   0.779872  0.0       0.59385   0.0       0.182493
  16 │     6  2     0.877439     1.34755   2.86516   0.0       1.34755   0.0       0.877439
  17 │     7  2     0.0666847    0.482311  0.489531  0.0       0.482311  0.0       0.0666847
  18 │     8  2     0.250345     0.554539  0.881611  0.0       0.554539  0.0       0.250345
  19 │     9  2     0.529188     0.717749  1.81985   0.0       0.717749  0.0       0.529188
  20 │    10  2     0.000215092  0.481304  0.274814  0.0       0.481304  0.0       0.000215092

We are interested in an regression of Y on X1 and X2 with instruments Z1 and Z2. We also consider id fixed effects.

For reference we also run the regression without fixed effects which is the only one that seems to work:

julia> reg(df_example, @formula(Y ~ (X1 + X2 ~ Z1 + Z2) ) )
=====================================================================
Number of obs:                 20   Degrees of freedom:             3
R2:                         0.775   R2 Adjusted:                0.749
F Statistic:              37.4684   p-value:                    0.000
First Stage F-stat (KP):  45.6504   First Stage p-val (KP):     0.000
=====================================================================
             Estimate Std.Error  t value Pr(>|t|) Lower 95% Upper 95%
---------------------------------------------------------------------
X1            3.25907  0.381129   8.5511    0.000   2.45496   4.06318
X2            3.18105  0.400632  7.94008    0.000   2.33579   4.02631
(Intercept)  -1.00948  0.336545 -2.99952    0.008  -1.71952 -0.299427
=====================================================================

Then we run the regression as recommended in the readme:

julia> reg(df_example, @formula(Y ~ (X1 + X2 ~ Z1 + Z2) + fe(id) ) )
ERROR: SingularException(1)
Stacktrace:
 [1] ldiv!(D::Diagonal{Float64, Vector{Float64}}, B::Matrix{Float64})
   @ LinearAlgebra /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/diagonal.jl:624
 [2] \(D::Diagonal{Float64, Vector{Float64}}, A::Matrix{Float64})
   @ LinearAlgebra /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/diagonal.jl:631
 [3] \(A::Matrix{Float64}, B::Matrix{Float64})
   @ LinearAlgebra /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/generic.jl:1128
 [4] ranktest!(X::Matrix{Float64}, Z::Matrix{Float64}, Pi::Matrix{Float64}, vcov_method::Vcov.SimpleCovariance, df_small::Int64, df_absorb::Int64)
   @ Vcov ~/.julia/packages/Vcov/ysvs8/src/ranktest.jl:38
 [5] reg(df::Any, formula::FormulaTerm, vcov::CovarianceEstimator; contrasts::Dict, weights::Union{Nothing, Symbol}, save::Union{Bool, Symbol}, method::Symbol, nthreads::Integer, double_precision::Bool, tol::Real, maxiter::Integer, drop_singletons::Bool, progress_bar::Bool, dof_add::Integer, subset::Union{Nothing, AbstractVector{T} where T})
   @ FixedEffectModels ~/.julia/packages/FixedEffectModels/nmx9K/src/fit.jl:436
 [6] reg (repeats 2 times)
   @ ~/.julia/packages/FixedEffectModels/nmx9K/src/fit.jl:58 [inlined]
 [7] top-level scope
   @ REPL[375]:1

Last we think it might be related to the demeaning process given that only controlling for id (a categorical) actually does not lead to an error. However the first stage F-stat does not get computed.

julia> reg(df_example, @formula(Y ~ (X1 + X2 ~ Z1 + Z2) + id ) )
                               IV Model
=======================================================================
Number of obs:                  20   Degrees of freedom:              4
R2:                          0.787   R2 Adjusted:                 0.747
F Statistic:               24.7636   p-value:                     0.000
First Stage F-stat (KP):       NaN   First Stage p-val (KP):        NaN
=======================================================================
              Estimate Std.Error   t value Pr(>|t|) Lower 95% Upper 95%
-----------------------------------------------------------------------
id: 2        -0.417303  0.722155 -0.577858    0.571   -1.9482    1.1136
X1             2.95074  0.641615   4.59892    0.000   1.59057    4.3109
X2             3.29365  0.470651   6.99808    0.000   2.29592   4.29139
(Intercept)  -0.707435  0.595924  -1.18712    0.253  -1.97074  0.555869
=======================================================================

We are trying to figure out how to get the right first stage F-stat when we have a lot of endogenous regressors and just as many instruments. Looking at it the issue might be in Vcov.

Digging further we also find that ivreg2 runs into similar trouble computing these statistics.

matthieugomez commented 3 years ago

Is it solved now?

eloualiche commented 3 years ago

Solved.

See https://github.com/matthieugomez/Vcov.jl/pull/5 and https://github.com/matthieugomez/Vcov.jl/issues/4