Propagation of missing values

jariji commented 1 year ago

One of the things I like most about Julia is that it propagates missing values, encouraging me to think critically about how I handle them in my data. For instance, sum([1,2,missing]) evaluates to missing, not 3, which tells me I need to be careful and think about why there are missing values and how I should handle them. I might want to drop them, or impute values, or realize that my data cleaning functions are broken and I need to fix them before modeling.

In the case of GLM, missing values are dropped. I would rather the result be missing, as it creates a summary of the data just like sum. Then I won't have a false impression that I'm using complete data and I'll think more about the meaning of my operations.

julia> lm(@formula(y~x), (;x=[1,2,3,missing], y=[10,20,31, 41]))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  -0.666667    0.62361   -1.07    0.4788   -8.59038    7.25704
x            10.5         0.288675  36.37    0.0175    6.83203   14.168
─────────────────────────────────────────────────────────────────────────

ararslan commented 1 year ago

Most software (at least that I've used) drops missing observations by default when fitting a regression model. However, those that do typically report how many observations were dropped, which I think is a critical piece of information that's missing from the output printed by StatsModels and other packages. IMO, the default display should include both the number of missing observations that were dropped and the number of observations that were actually used for modeling. Would adding that sufficiently address your concern here?

jariji commented 1 year ago

Would adding that sufficiently address your concern here?

Without taking a position just yet, I'll expand on a few options I can see so far.

Automatic dropping

Most software (at least that I've used) drops missing observations by default when fitting a regression model.

That has been my experience too, though not universally. Pandas and SQL both drop missings for summaries like mean, which I don't like. R propagates missings for mean but ignores them for lm. I'm not sure what Stata and SAS do.

If this approach is taken, including #missing in the output would be useful.

Manual dropping

The popular Statistical Rethinking textbook reassures the user of its associated rethinking R package that

you can rest assured that quap, unlike reckless functions like lm, would never silently drop cases

taking a strong stance on that issue.

To manually drop, instead of

using DataFrames, GLM
d = DataFrame(x=[1,2,3,missing], y=[10,20,31, 41]);
lm(@formula(y~x), d)

I would use

using DataFrames, GLM, StatsModels
d = DataFrame(x=[1,2,3,missing], y=[10,20,31, 41])
f = @formula(y~x)
lm(f, dropmissing(d, StatsModels.termvars(f)))

which is admittedly a bit less convenient since it requires StatsModels and needs an extra line for binding f. The issue of teaching users to do this could be addressed with a simple error message that demonstrates how to do it.

A boolean flag argument

DataFrames.unstack uses allowmissing::Bool=false, which is another approach to balancing convenience and correctness. Note that the default is false, the safe option.

A functional declarative approach?

Maybe there is a way to wrap lm or something like passmissing(lm)(f,d). passmissing doesn't have exactly the right meaning but maybe there's another function that would.

ararslan commented 1 year ago

I'm not sure what Stata and SAS do.

SAS drops observations with missing values and reports the number dropped. I've never used Stata but a quick google suggests it drops automatically, though I don't know whether it reflects that in the results it displays.

nalimilan commented 1 year ago

It's true that skipping missing values when fitting models isn't consistent with how we handle missing values elsewhere in the ecosystem. This is because GLM was written before missing values support was added to Julia, and it's a legacy from R.

As @ararslan suggested, I think it would be good to print the number of observations dropped due to the presence of missing values. https://github.com/JuliaStats/GLM.jl/pull/339 should make this possible.

We could take a stricter stance and throw an error in the presence of missing values, but that would be breaking so I'm not sure it's worth it, and we would have to go through a deprecation period where a warning would be thrown anyway. I think the API would have to be lm(..., skipmissing=true) for consistency with skipmissing, and with the similar argument in DataFrames's groupby and StatsBase's pairwise.

The first option isn't contradictory with the second one so we could start with it anyway.

gragusa commented 1 year ago

An argument for not automatically dropping missings can be made when the model is weighted. As of now

lm(@formula(y~x), data=df, weights=aweights(df.w))

with missing values in either y, x, or w will throw an error.

JuliaStats / GLM.jl