FixedEffects / FixedEffectModels.jl

Fast Estimation of Linear Models with IV and High Dimensional Categorical Variables
Other
225 stars 46 forks source link

Drop regressors that are collinear with the fixed effects (depending on tolerance for partialling-out) #221

Closed moritzdrechselgrau closed 1 year ago

moritzdrechselgrau commented 1 year ago

With large datasets and multiple fixed effects, the default tolerance setting of tol = 1e-6, regressors that are collinear with the fixed effects may not be omitted even though they clearly should.

In Stata's reghdfe, these regressors are dropped because of an additional check that compares the sum of squares of each variable before and after partialling out the fixed effects (for residualized collinear variables, the sum of squares is very close to zero).

Here is a minimal working example using the Cigar.csv data in the repo which has to be tweaked a bit to make it work.

using DataFrames, CSV, FixedEffectModels, Random, StatsBase

# read the data
df = DataFrame(CSV.File(joinpath(dirname(pathof(FixedEffectModels)), "../dataset/Cigar.csv")))

# create a bigger dataset
sort!(df, [:State, :Year])
nstates = maximum(df.State)
dflarge = copy(df)
for i in 1:100
    dfnew = copy(df)
    dfnew.State .+= i .* nstates
    append!(dflarge, dfnew)
end

# create a dummy variable that is collinear with the State-FE
dflarge.highstate = dflarge.State .< median(dflarge.State)

# create a second 'high-dimensional' categorical variable
Random.seed!(1234)
dflarge.catvar = rand(1:200, nrow(dflarge))

# run the regression with the default setting (tol = 1e-6)
reg(dflarge, @formula(Price ~ highstate + Pop + fe(Year) + fe(catvar) + fe(State)), Vcov.cluster(:State); tol=1e-6)

# run the regression with a lower tolerance (tol = 1e-8)
reg(dflarge, @formula(Price ~ highstate + Pop + fe(Year) + fe(catvar) + fe(State)), Vcov.cluster(:State); tol=1e-8)

Running the regression with the default settings where highstate is not recognized as collinear:

                             Fixed Effect Model
============================================================================
Number of obs:                 139380  Degrees of freedom:                 1
R2:                             0.988  R2 Adjusted:                    0.988
F-Stat:                       549.718  p-value:                        0.000
R2 within:                      0.026  Iterations:                         7
============================================================================
Price     |   Estimate  Std.Error    t value Pr(>|t|)   Lower 95%  Upper 95%
----------------------------------------------------------------------------
highstate |   0.445994    21244.6 2.09933e-5    1.000    -41649.1    41650.0
Pop       | 0.00102457 3.09008e-5    33.1569    0.000 0.000963994 0.00108515
============================================================================

Reducing the tolerance 'fixes' the issue because the function FixedEffectModels.invsym! essentially uses sqrt(eps()) as the tolerance criterion for variables with very small sums of squares, i.e. collinear ones. The more precise the partialling-out, the more likely this function detects the collinearity.

                           Fixed Effect Model
=========================================================================
Number of obs:               139380   Degrees of freedom:               1
R2:                           0.988   R2 Adjusted:                  0.988
F-Stat:                     1101.09   p-value:                      0.000
R2 within:                    0.026   Iterations:                       9
=========================================================================
Price     |   Estimate  Std.Error t value Pr(>|t|)   Lower 95%  Upper 95%
-------------------------------------------------------------------------
highstate |        0.0        NaN     NaN      NaN         NaN        NaN
Pop       | 0.00102414 3.08638e-5 33.1827    0.000 0.000963636 0.00108465
=========================================================================

I do not think that simply changing the default tolerance solves this issue. I will shortly submit a PR that implements the procedure of Stata's reghdfe which is to drop variables where the sum of squares after residualizing divided by the sum of squares before residualizing is smaller than min(1e-6, tol / 10).