With large datasets and multiple fixed effects, the default tolerance setting of tol = 1e-6, regressors that are collinear with the fixed effects may not be omitted even though they clearly should.
In Stata's reghdfe, these regressors are dropped because of an additional check that compares the sum of squares of each variable before and after partialling out the fixed effects (for residualized collinear variables, the sum of squares is very close to zero).
Here is a minimal working example using the Cigar.csv data in the repo which has to be tweaked a bit to make it work.
using DataFrames, CSV, FixedEffectModels, Random, StatsBase
# read the data
df = DataFrame(CSV.File(joinpath(dirname(pathof(FixedEffectModels)), "../dataset/Cigar.csv")))
# create a bigger dataset
sort!(df, [:State, :Year])
nstates = maximum(df.State)
dflarge = copy(df)
for i in 1:100
dfnew = copy(df)
dfnew.State .+= i .* nstates
append!(dflarge, dfnew)
end
# create a dummy variable that is collinear with the State-FE
dflarge.highstate = dflarge.State .< median(dflarge.State)
# create a second 'high-dimensional' categorical variable
Random.seed!(1234)
dflarge.catvar = rand(1:200, nrow(dflarge))
# run the regression with the default setting (tol = 1e-6)
reg(dflarge, @formula(Price ~ highstate + Pop + fe(Year) + fe(catvar) + fe(State)), Vcov.cluster(:State); tol=1e-6)
# run the regression with a lower tolerance (tol = 1e-8)
reg(dflarge, @formula(Price ~ highstate + Pop + fe(Year) + fe(catvar) + fe(State)), Vcov.cluster(:State); tol=1e-8)
Running the regression with the default settings where highstate is not recognized as collinear:
Fixed Effect Model
============================================================================
Number of obs: 139380 Degrees of freedom: 1
R2: 0.988 R2 Adjusted: 0.988
F-Stat: 549.718 p-value: 0.000
R2 within: 0.026 Iterations: 7
============================================================================
Price | Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
----------------------------------------------------------------------------
highstate | 0.445994 21244.6 2.09933e-5 1.000 -41649.1 41650.0
Pop | 0.00102457 3.09008e-5 33.1569 0.000 0.000963994 0.00108515
============================================================================
Reducing the tolerance 'fixes' the issue because the function FixedEffectModels.invsym! essentially uses sqrt(eps()) as the tolerance criterion for variables with very small sums of squares, i.e. collinear ones. The more precise the partialling-out, the more likely this function detects the collinearity.
Fixed Effect Model
=========================================================================
Number of obs: 139380 Degrees of freedom: 1
R2: 0.988 R2 Adjusted: 0.988
F-Stat: 1101.09 p-value: 0.000
R2 within: 0.026 Iterations: 9
=========================================================================
Price | Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
-------------------------------------------------------------------------
highstate | 0.0 NaN NaN NaN NaN NaN
Pop | 0.00102414 3.08638e-5 33.1827 0.000 0.000963636 0.00108465
=========================================================================
I do not think that simply changing the default tolerance solves this issue. I will shortly submit a PR that implements the procedure of Stata's reghdfe which is to drop variables where the sum of squares after residualizing divided by the sum of squares before residualizing is smaller than min(1e-6, tol / 10).
With large datasets and multiple fixed effects, the default tolerance setting of
tol = 1e-6
, regressors that are collinear with the fixed effects may not be omitted even though they clearly should.In Stata's
reghdfe
, these regressors are dropped because of an additional check that compares the sum of squares of each variable before and after partialling out the fixed effects (for residualized collinear variables, the sum of squares is very close to zero).Here is a minimal working example using the
Cigar.csv
data in the repo which has to be tweaked a bit to make it work.Running the regression with the default settings where
highstate
is not recognized as collinear:Reducing the tolerance 'fixes' the issue because the function
FixedEffectModels.invsym!
essentially usessqrt(eps())
as the tolerance criterion for variables with very small sums of squares, i.e. collinear ones. The more precise the partialling-out, the more likely this function detects the collinearity.I do not think that simply changing the default tolerance solves this issue. I will shortly submit a PR that implements the procedure of Stata's
reghdfe
which is to drop variables where the sum of squares after residualizing divided by the sum of squares before residualizing is smaller thanmin(1e-6, tol / 10)
.