Open jariji opened 1 year ago
Most software (at least that I've used) drops missing observations by default when fitting a regression model. However, those that do typically report how many observations were dropped, which I think is a critical piece of information that's missing from the output printed by StatsModels and other packages. IMO, the default display should include both the number of missing observations that were dropped and the number of observations that were actually used for modeling. Would adding that sufficiently address your concern here?
Would adding that sufficiently address your concern here?
Without taking a position just yet, I'll expand on a few options I can see so far.
Most software (at least that I've used) drops missing observations by default when fitting a regression model.
That has been my experience too, though not universally. Pandas and SQL both drop missings for summaries like mean
, which I don't like. R propagates missings for mean
but ignores them for lm
. I'm not sure what Stata and SAS do.
If this approach is taken, including #missing in the output would be useful.
The popular Statistical Rethinking textbook reassures the user of its associated rethinking
R package that
you can rest assured that
quap
, unlike reckless functions likelm
, would never silently drop cases
taking a strong stance on that issue.
To manually drop, instead of
using DataFrames, GLM
d = DataFrame(x=[1,2,3,missing], y=[10,20,31, 41]);
lm(@formula(y~x), d)
I would use
using DataFrames, GLM, StatsModels
d = DataFrame(x=[1,2,3,missing], y=[10,20,31, 41])
f = @formula(y~x)
lm(f, dropmissing(d, StatsModels.termvars(f)))
which is admittedly a bit less convenient since it requires StatsModels
and needs an extra line for binding f
. The issue of teaching users to do this could be addressed with a simple error message that demonstrates how to do it.
DataFrames.unstack
uses allowmissing::Bool=false
, which is another approach to balancing convenience and correctness. Note that the default is false
, the safe option.
Maybe there is a way to wrap lm
or something like passmissing(lm)(f,d)
. passmissing
doesn't have exactly the right meaning but maybe there's another function that would.
I'm not sure what Stata and SAS do.
SAS drops observations with missing values and reports the number dropped. I've never used Stata but a quick google suggests it drops automatically, though I don't know whether it reflects that in the results it displays.
It's true that skipping missing values when fitting models isn't consistent with how we handle missing values elsewhere in the ecosystem. This is because GLM was written before missing values support was added to Julia, and it's a legacy from R.
As @ararslan suggested, I think it would be good to print the number of observations dropped due to the presence of missing values. https://github.com/JuliaStats/GLM.jl/pull/339 should make this possible.
We could take a stricter stance and throw an error in the presence of missing values, but that would be breaking so I'm not sure it's worth it, and we would have to go through a deprecation period where a warning would be thrown anyway. I think the API would have to be lm(..., skipmissing=true)
for consistency with skipmissing
, and with the similar argument in DataFrames's groupby
and StatsBase's pairwise
.
The first option isn't contradictory with the second one so we could start with it anyway.
An argument for not automatically dropping missings can be made when the model is weighted. As of now
lm(@formula(y~x), data=df, weights=aweights(df.w))
with missing values in either y
, x
, or w
will throw an error.
One of the things I like most about Julia is that it propagates missing values, encouraging me to think critically about how I handle them in my data. For instance,
sum([1,2,missing])
evaluates tomissing
, not3
, which tells me I need to be careful and think about why there are missing values and how I should handle them. I might want to drop them, or impute values, or realize that my data cleaning functions are broken and I need to fix them before modeling.In the case of GLM, missing values are dropped. I would rather the result be
missing
, as it creates a summary of the data just likesum
. Then I won't have a false impression that I'm using complete data and I'll think more about the meaning of my operations.