JuliaStats / GLM.jl

Generalized linear models in Julia
Other
584 stars 114 forks source link

Propagate missing values from data when returning vectors #546

Open nalimilan opened 10 months ago

nalimilan commented 10 months ago

When a model was fitted on data containing missing values which have been dropped, predict(model), residuals(model), fitted(model) and response(model) should return a vector of the same length as the data with missing for corresponding entries. It would make sense to change this before releasing GLM 2.0, even if a more general fix is implemented later in StatsModels (https://github.com/JuliaStats/StatsModels.jl/issues/202).

See https://github.com/JuliaStats/MixedModels.jl/issues/650 for a similar issue in MixedModels.

ararslan commented 10 months ago

Would we do the same for modelmatrix? If not, the dimensions of response and modelmatrix would disagree, which seems pretty awful. If so, crossmodelmatrix is modelmatrix(model)' * modelmatrix(model), which becomes pretty useless if modelmatrix includes missing values...

ararslan commented 10 months ago

This also means that response(model) et al. won't have length nobs(model) (for unweighted models).

bkamins commented 10 months ago

A related issue (if a separate issue should be opened for this please let me know, but I think this should be revolved along with the original issue with a consistent approach):

julia> lm(@formula(x1~lag(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

(the issue is that, if I understand things correctly, lagging is applied after dropping of rows with missing is performed)

More generally e.g.:

julia> f(x) = x > 0.5 ? 1 : missing
f (generic function with 1 method)

julia> lm(@formula(x1~f(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64
nalimilan commented 10 months ago

I think that's https://github.com/JuliaStats/StatsModels.jl/issues/202. It's a bit harder to fix.

nalimilan commented 10 months ago

Would we do the same for modelmatrix? If not, the dimensions of response and modelmatrix would disagree, which seems pretty awful. If so, crossmodelmatrix is modelmatrix(model)' * modelmatrix(model), which becomes pretty useless if modelmatrix includes missing values...

Good point. We could say that functions related to the fitting procedure, like modelmatrix and crossmodelmatrix never include missing values. But I admit that's a bit ad hoc, but that's what R does when using na.exclude (i.e. only pad residuals and predicted values with missing values).

bkamins commented 10 months ago

It's a bit harder to fix.

I agree that it is probably harder, but my point was that we should keep also this case in mind when developing a solution (as it might be affected by that requirement).

ararslan commented 10 months ago

functions related to the fitting procedure

That arguably includes response, residuals, and fitted... probably everything but predict. "Related to the fitting procedure" may also depend on the kind of model and wouldn't necessarily be the same between GLM and another statistical modeling package.

IMO the current behavior that these functions return the objects/quantities as actually used by the model is the right one, and the case of including any missing values that were present in the input is better handled separately. I actually think that the existing predict(model, data), where data is the original input that includes missing values, is a sensible solution to getting back fitted values that include missing: you're asking for the model's output when given that input, which is inherently different from what the model used for fitting (in the presence of missing values).