Missing values and weighting

JuliaStats / Statistics.jl

The Statistics stdlib that ships with Julia.

https://juliastats.org/Statistics.jl/dev/

Other

70 stars 40 forks source link

Missing values and weighting #88

Open nalimilan opened 2 years ago

nalimilan commented 2 years ago

We currently have an efficient and consistent solution to skip missing values for unweighted single-argument functions via f(skipmissing(x)). For multiple-argument functions like cor we don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87).

A reasonable solution would be to use f(skipmissing(x), weights=w), with a typical definition being:

function f(s::SkipMissing{<:AbstractVector}; weights::AbstractVector)
    size(s.x) == size(weights) || throw(DimensionMismatch())
    inds= find(!ismissing, s.x)
    f(view(s.x, inds), weights=view(weights, inds))
end

That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in skipmissing. A wrapper like skipmissing(weighted(x, w)) (inspired by what was proposed at https://github.com/JuliaLang/julia/pull/33310) would be cleaner in that regard. But that would still be quite ad-hoc, as skipmissing currently only accepts collections (and weighted cannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods.

The generalization to multiple-argument functions poses the same challenges as cor. For these, the simplest solution would be to use a skipmissing keyword argument, a bit like pairwise. Again, the alternative would be to use wrappers like skipmissing(weighted(w, x, y)).

Overall, the problem is that we have conflicting goals:

be able to skip missing values with functions that don't have any special support for them using f(skipmissing(x))
use a similar syntax for unweighted and weighted functions, e.g. f(skipmissing(x)) vs f(skipmissing(x), weights=w), or f(skipmissing(x)) vs f(skipmissing(weighted(x, w))), or f(x, skipmissing=true) vs f(x, skipmissing=true, weights=w)
use a similar syntax for single- and multiple-argument functions, e.g. f(skipmissing(x)) vs f(skipmissing(x, y)), or f(x, skipmissing=true) vs f(x, y, skipmissing=true)
use a similar syntax for simple functions operating on vectors (like mean) and complex functions operating on whole tables (like fit(MODEL, ..., data=df, weights=w) and which skip missing values by default)

pdeffebach commented 2 years ago

Thanks for your hard work on this. I know what a tricky PR it is,

I really think something like passmissing for vectors is the solution. We don't have to support weights for everything if we have a function wrapper like

passmissing_vec(cor)(x, y)

which passes views of the non-missing entries of x and y. So if possible, something that works well with that kind of solution is most desirable. If we can support keyword arguments (I'm not sure why passmissing does not), then we can have

passmissing_vec(f)(x, weights = w)

that seems like a reasonable solution.

Of course, that doesn't work for something like

fit(MODEL, ..., data=df, weights = w)

but tbh I think that having weights as a separate vector from data is a bad design choice and should be changed in 2.0. If we allowed a column identifier, like a Symbol, then missings could be handled by the model appropriately. Even better, have weights be handled in the @formula (just like clustering, but that's another issue).

I'm glad you didn't propose dispatch based on AbstractWeights. With a call like sum(x, weights(w)), you can't use any sort of function wrapper passmissing_vec(x, ...) to neatly skip missing values. Plus I think this use of dispatch makes code very hard to maintain, if that matters.

As for cov(weighted(x, w)): this approach seems reasonable. We would have to have a whole WeightedStatistics package which re-defines common statistics functions for different AbstractWeighted objects. But then we are back to StatsBase, a whole other statistics package people have to load for small things. I don't share your concern with handling missings, though. We could just do cov(passmissing_vec(weighted)(x, w))

nalimilan commented 2 years ago

Ah, yes, wrapping the function rather than the data is indeed an interesting solution to allow passing weights while skipping missing values. One advantage would be that passmissing_vec(f)(...) should work for almost all functions so it would be consistent and easy to remember for users (though the syntax is a bit unusual).

The downside I see is that passmissing_vec(f)(x) would always work in cases where one can use f(skipmissing(x)), but the former makes one additional pass over the data. This means that users may not get the best performance if they don't carefully check whether they can use skipmissing with a given function (e.g. mean). Maybe that's not a big problem, but still something to keep in mind. (It should be possible to override (::typeof(passmissing_vec(f)))(...) for some specific functions if we want to make it fast but we certainly don't want to do that for many functions; I wonder whether we could automate this by passing skipmissing to functions whose signature allow it...)

I'm glad you didn't propose dispatch based on AbstractWeights. With a call like sum(x, weights(w)), you can't use any sort of function wrapper passmissing_vec(x, ...) to neatly skip missing values. Plus I think this use of dispatch makes code very hard to maintain, if that matters.

Are you sure? I'm not saying we should keep dispatch based on AbstractWeights, but given that it's an AbstractVector it should work well with passmissing_vec, whether it's passed as a positional or a keyword argument.

pdeffebach commented 2 years ago

Are you sure? I'm not saying we should keep dispatch based on AbstractWeights, but given that it's an AbstractVector it should work well with passmissing_vec, whether it's passed as a positional or a keyword argument.

Ah sorry, yes. But currently weights(x) fails if x has missing values. So we would have to change weights first to make it work.

nalimilan commented 2 years ago

We've discussed passmissing_vec with @bkamins and while it sounds promising, we identified a difficulty that deserves some discussion. It turns out that some statistical methods take several vectors whose entries do not correspond to observations: the best example is quantile(x, p).This is problematic as it means passmissing_vec(quantile)(x, p) won't work. There are at least three solutions to this:

Make p a keyword argument instead, and have passmissing_vec only skip missing values for a hardcoded list of keyword arguments, which would include weights.
Use a more complex design with a macro and escape sequences to specify which arguments should not be modified, like @passmissing quantile(x, $p).
Tell people to use a tuple rather than a vector for p.

AFAICT the number of functions in that case in StatsBase is relatively limited:

The most similar are counts(x, levels) and proportions(x, levels). Since they return a vector, we can't use them with passmissing_vecs anyway as the result shouldn't be "spread" to the input vector length.
fit(Histogram, data, edges) is also affected, but fit methods probably need a special treatment anyway, as functions that return a complex object will often want to handle missing values themselves (e.g. so that predict for models fills entries with missing for observations with missing values). Otherwise, edges could become a keyword argument.
autocov(x, lags), autocor(x, lags), crosscov(x, y, lags), crosscor(x, y, lags) and pacf(x, lags) would be misleading if used with passmissing_vecs as the relative positions of values would change when skipping missing values. The only solution to have a special argument to indicate that only complete pairs of observations after applying the lag to x should be used (this is what R does).
partialcor(x, y, Z) is OK when Z is a vector, but not when it's a matrix (same for pacf(X, lags) with a matrix). This raises the more general question of how to handle matrices (and Tables.jl tables), as we don't know whether observations are stored as rows or columns. But this isn't essential IMO, and we could use passmissing_vecs(f, dims=1/2) or passmissingcols/passmissingrows if we really want this.
inverse_rle(vals, lens) wouldn't work with passmissing_vecs either, but I don't think it should

So overall maybe quantile is the most notable exception to this.

Another corner case to consider are in-place functions like zscore!. Passing a view of complete cases to it means that other entries will be left as they are, rather than set to missing. Maybe that's OK but it's worth noting it.

pdeffebach commented 2 years ago

Use a more complex design with a macro and escape sequences to specify which arguments should not be modified, like @passmissing quantile(x, $p)

We could also do passmissing(quantile)(x, Ref(p)), like broadcasting (and the use of $ in @.).

bkamins commented 2 years ago

Let me add one more thing. When do we need something like passmissing_vec?

To answer this note the following:

it is not needed when a single collection is accepted (without weights), as then skipmisisng wrapper is good enough and we already have it;

This means that we need passmissing_vec only if:

we both want to pass a collection and weights
we want to pass more than one collection e.g. in cor(x, y) (with or without weights)
we want to pass something else than a collection that encodes multiple columns (e.g. a matrix)

Now the challenge is that general Tables.jl tables will not be able to support passmissing_vec as the interface does not specify a way to drop observations, so we would be left with with vectors or matrices (for which we should add some kind of dims kwarg as @nalimilan noted).

As an additional note some functions will still need a special treatment of missings, e.g. cor (or in general pairwise) needs it (as it can do whole observation or pairwise missing removal) which cannot be handled just by using a wrapper.

Having said all that I have the following thought:

I think it is useful to have passmissing_vec function for methods that are not missing-aware.
However, I would discuss what downsides we see with making the standard functions missing-aware by adding skipmissing kwarg to them (just like we would add the weights kwarg where function needs to be weighting aware)

@nalimilan - I know I am going back to what we already discussed but I really feel that most of typical users will have a much easier time learning to use skipmissing kwarg. Can we try to put a list of arguments against making skipmissing a kwarg in this thread? (note that having a kwarg is not conflicting with passmissing_vec - as it still could be added to make it easier to handle functions that are not missing aware)

nalimilan commented 2 years ago

@nalimilan - I know I am going back to what we already discussed but I really feel that most of typical users will have a much easier time learning to use skipmissing kwarg. Can we try to put a list of arguments against making skipmissing a kwarg in this thread? (note that having a kwarg is not conflicting with passmissing_vec - as it still could be added to make it easier to handle functions that are not missing aware)

I think the main argument against a skipmissing keyword argument is that it requires specific support in each function. That might be OK if it was needed only for multiple-argument functions, but if we want to support weighting that means adding that argument to basically all functions in Statistics (not all of them currently support weights, but ideally they would at some point). And even then, many packages may not accept adding such an argument (e.g. it would probably not be accepted for sum in Base), so users would have to know how to use different syntaxes depending on functions.

Conversely, one of the advantages of passmissing_vec is that we can easily try that approach by adding a function to Missings.jl, and if it doesn't work we can implement another solution, and just deprecate it or let people ignore it.

bkamins commented 2 years ago

Let me ask in #data and #statistics.

pdeffebach commented 2 years ago

Now the challenge is that general Tables.jl tables will not be able to support passmissing_vec as the interface does not specify a way to drop observations, so we would be left with with vectors or matrices (for which we should add some kind of dims kwarg as @nalimilan noted).

What statistics functions work on Tables.jl objects which don't already have an internal handling of missing, though? Do you mean something like fit(model, df; weights = ...)?

bkamins commented 2 years ago

What statistics functions work on Tables.jl objects

My comment is that in general there is an expectation to potentially allow Tables.jl objects in the functions provided (and the discussion how to do it has not happened yet).

adkabo commented 2 years ago

It's great to see Julia's evolution in the space of missingness handling, and the thoughtful discussions that underlie them. I have a thought I'd like to get some of your expert feedback on, and I'm hoping this is the right place for it.

I consider among Julia's biggest data-analytic advantages over its competitors that Julia functions respect missingness rather than hiding it.

But as @nalimilan writes,

use a similar syntax for simple functions operating on vectors (like mean) and complex functions operating on whole tables (like fit(MODEL, ..., data=df, weights=w) and which skip missing values by default)

there is a whole class of important functions that ignore missing data. When I'm using these functions, I'm always nervous I'll make a mistake about which rows I'm using.

In my applications, values are not "missing completely at random", and assuming they are will lead me to incorrect analytic conclusions. An analysis that drops rows missing e.g. income values may overlook a significant relationship if the missingness is correlated with other features of the data.

Furthermore, even if the top-line results are fine, I like to attend to missingness even in intermediate steps like model selection. As Richard McElreath puts it in Statistical Rethinking,

By removing the cases with missing values at the start, we accomplish the most important step in model comparison: Compared models must be fit to exactly the same observations. If you don’t remove the incomplete cases before you begin fitting models, you risk comparing models fit to different numbers of observations. This is a serious risk, because R’s more automated model fitting routines, like lm, will automatically and silently drop incomplete cases for you. If one model uses a predictor that has missing values, while another does not, then each model will be fit to a different number of cases. The model fit to fewer observations will almost always have a better deviance and AIC/DIC/WAIC value, because it has been asked to predict less.

Getting missing values in results or error messages will alert me that I need to be concerned about missingness in my dataset.

There are many possibilities for how to be explicit about it when that is desired, e.g. fit(m, dropmissing(df)), fitnonmissing(m, df), or -- ideally -- handling missingness in a separate step prior to fitting, and propagating missings or raising errors otherwise. But I would feel more confident in my statistics if my main fitting API were missing-safe.

I would love to hear how you think about this issue and how I might operate under the next statistics API.

nalimilan commented 2 years ago

Yes we should probably not drop rows with missing values silently when fitting models as it's inconsistent with the rest of Julia. fit(m, dropmissing(df)) is problematic as it will drop lots of rows if df contains columns that are not used in any of the fitted models, though. I'd be more inclined to have a skipmissing::Bool keyword argument. We should also provide user-friendly APIs to fit multiple models in a single cal, e.g. by passing a vector of formulas. Stata's nestreg is an interesting precedent.

(Note that none of this is defined in Statistics. That's really a separate issue IMO since it cannot be handled using passmissing_vec as discussed above.)

bkamins commented 2 years ago

Yes, the issue is that how and which missings should be dropped is a analytical procedure specific information and has to be handled internally by it.

But it has a consequence for a general design that maybe we in the end need to accept that there will be a skipmissing kwarg in many functions and the passmissing_vec will be just an alternative, in particular for functions that do not have this kwarg implemented.

pdeffebach commented 2 years ago

In an ideal world, this goes in the model. The ideal model is something like

formula = @formula(y ~ x | fe = absorb(x, t) | cluster(state) | dropmissing(all) | weight(w))
fit(formula, df)

But that's something for the very long term and dedicated funding.