FixedEffects / FixedEffectModels.jl

Fast Estimation of Linear Models with IV and High Dimensional Categorical Variables
Other
227 stars 46 forks source link

Behavior of DummyCoding in an interaction term #197

Closed caibengbu closed 2 years ago

caibengbu commented 2 years ago

Hello all,

I found DummyCoding very useful in practices, but when used in an interaction term, DummyCoding is automatically converted to FullDummyCoding.

Here is a minimal example:

using DataFrames
using FixedEffectModels
using Random
Random.seed!(1234)

# build an example dataset
N = 10 # 10 individuals
T = 3 # 3 periods
event_time = 2 # make the second period be the event time
id = repeat(1:N, inner=T) # generate id
is_treated = id .< N/2 # make first half of individuals are treated, last half of indivuduals are controls
time = repeat(1:T, outer=N) .- event_time # generate time
treatment = repeat(rand(N), inner=T) .* is_treated # generate treatment, 0 if obs are controls
outcome = treatment .* (time .> 0) + id + time .+ rand(N*T) # generate outcome, treatment only have an effect after event time
# use face value of id and face value of time as FEs, add a noise.
df = DataFrame(id = id, time = time, outcome = outcome, treatment = treatment)

# run regression
res = reg(df, @formula(outcome ~ time&treatment + fe(id) + fe(time)); contrasts = Dict(:time => DummyCoding(base=0)))

The result returned is

                               Fixed Effect Model                               
=================================================================================
Number of obs:                       30   Degrees of freedom:                  15
R2:                               0.996   R2 Adjusted:                      0.993
F-Stat:                        0.602379   p-value:                          0.451
R2 within:                        0.074   Iterations:                           2
=================================================================================
outcome              |  Estimate Std.Error   t value Pr(>|t|) Lower 95% Upper 95%
---------------------------------------------------------------------------------
time: -1 & treatment |  0.247485  0.381355  0.648961    0.526 -0.565354   1.06032
time: 0 & treatment  | -0.168612  0.381355 -0.442139    0.665  -0.98145  0.644227
time: 1 & treatment  |       0.0       NaN       NaN      NaN       NaN       NaN
=================================================================================

while time: 0 & treatment should be dropped.

PS: I found a quick fix to this issue, which is to rewrite line 308 as StatsModels.collect_matrix_terms(apply_schema(t.rhs, schema.schema, StatisticalModel))).

https://github.com/FixedEffects/FixedEffectModels.jl/blob/4c1ff7ff9ca9ceba10973476866b1e652d2f0cd2/src/FixedEffectModel.jl#L302-L309

But I am not sure if this will introduce other errors/inconveniences.

matthieugomez commented 2 years ago

Just looking at the formula, there is no way to know that one time category will have to be dropped (since treatment does not appear in the formula). Therefore, it is a problem of collinearity — I think that, in this case, it's fine to drop coefficients randomly.

Maybe this gives something closer to what you want:

 res = reg(df, @formula(outcome ~ time&treatment + treatment + fe(id) + fe(time)); contrasts = Dict(:time => DummyCoding(base=0)))
caibengbu commented 2 years ago

Great! The solution works well.

I think this is an issue about how categorical variables are dealt with in formulas in general (just learned from here that the automatic promotion is actually designed that way), instead of an issue of FixedEffectModels.jl. My initial point was that the decision of promotion to FullDummyCoding would be useful to be left to users.

Thank you for your reply.

matthieugomez commented 2 years ago

Can I close this?

caibengbu commented 2 years ago

Sure! Thank you again for the advice.