FixedEffects / FixedEffectModels.jl

Fast Estimation of Linear Models with IV and High Dimensional Categorical Variables
Other
225 stars 46 forks source link

Add predict for FixedEffectModel #171

Closed nilshg closed 3 years ago

nilshg commented 3 years ago

Here's a first stab at an implementation for predict for FixedEffectModels. Essentially this leftjoins the fixed effects onto the relevant columns of the data passed to predict, and then sums them to create a vector that is added to the "regular" prediction obtained by multiplying the non-FE columns with their respective coefficients.

I have checked that this works on the original data, as well as a new data set with the same levels. It also provides comparable predictions to a predict call on the model estimated without marking the categorical variables out as fe()s. When a new data set with missing observations is passed the code errors, which appears consistent with what currently happens for predict with a non-FE model.

One difference in behaviour is for the case of new levels in the fixed effects - in the case of a non-FE model, predict currently errors, while with this PR, for a model that has_fe, predictions are returned, with missing in rows where a new level is encountered in a fixed effect which was not included in the original data (this is an artefact of leftjoin producing missing in that case).

Happy to discuss whether this gives a reasonable user experience. Two things I haven't thought about here:

matthieugomez commented 3 years ago

Sorry, there was an issue merging. I'd be happy with adding something like this but you'd need to make sure you handle missings (missings in fes v.s. missing in other variables) + add tests

Also, I think it'd be better to do something like

df isa AbstractDataFrame || throw("...")
sum(Matrix(leftjoin(select(df, x.fekeys), unique(x.fe), on = x.fekeys, makeunique = true)), dims = 2)

(as well as avoiding creating a vector if there are no fixed effects)

nilshg commented 3 years ago

Happy to try and add some tests in the next days. Not sure I understand your point about missings - would you expect different behaviour for missing FEs vs other covariates? Naively I would have thought that the prediction is ŷ = f̂e + β̂₁x₁ + β̂₂x₂ + ... which gives missing if either the fixed effect is missing (i.e. the predict df has a level that wasn't present in the original df) or any of the xs is missing. Would you be looking for some other behaviour, e.g. setting the fe to 0 or grand mean or something?

matthieugomez commented 3 years ago

what you're saying is correct — just check that it gives missing if any of the covariate or fixed effect is missing.