Open nalimilan opened 2 years ago
Thanks for your hard work on this. I know what a tricky PR it is,
I really think something like passmissing
for vectors is the solution. We don't have to support weights for everything if we have a function wrapper like
passmissing_vec(cor)(x, y)
which passes view
s of the non-missing entries of x
and y
. So if possible, something that works well with that kind of solution is most desirable. If we can support keyword arguments (I'm not sure why passmissing
does not), then we can have
passmissing_vec(f)(x, weights = w)
that seems like a reasonable solution.
Of course, that doesn't work for something like
fit(MODEL, ..., data=df, weights = w)
but tbh I think that having weights as a separate vector from data
is a bad design choice and should be changed in 2.0. If we allowed a column identifier, like a Symbol
, then missing
s could be handled by the model appropriately. Even better, have weights be handled in the @formula
(just like clustering, but that's another issue).
I'm glad you didn't propose dispatch based on AbstractWeights
. With a call like sum(x, weights(w))
, you can't use any sort of function wrapper passmissing_vec(x, ...)
to neatly skip missing values. Plus I think this use of dispatch makes code very hard to maintain, if that matters.
As for cov(weighted(x, w))
: this approach seems reasonable. We would have to have a whole WeightedStatistics
package which re-defines common statistics functions for different AbstractWeighted
objects. But then we are back to StatsBase, a whole other statistics package people have to load for small things. I don't share your concern with handling missing
s, though. We could just do cov(passmissing_vec(weighted)(x, w))
Ah, yes, wrapping the function rather than the data is indeed an interesting solution to allow passing weights while skipping missing values. One advantage would be that passmissing_vec(f)(...)
should work for almost all functions so it would be consistent and easy to remember for users (though the syntax is a bit unusual).
The downside I see is that passmissing_vec(f)(x)
would always work in cases where one can use f(skipmissing(x))
, but the former makes one additional pass over the data. This means that users may not get the best performance if they don't carefully check whether they can use skipmissing
with a given function (e.g. mean
). Maybe that's not a big problem, but still something to keep in mind. (It should be possible to override (::typeof(passmissing_vec(f)))(...)
for some specific functions if we want to make it fast but we certainly don't want to do that for many functions; I wonder whether we could automate this by passing skipmissing
to functions whose signature allow it...)
I'm glad you didn't propose dispatch based on
AbstractWeights
. With a call likesum(x, weights(w))
, you can't use any sort of function wrapperpassmissing_vec(x, ...)
to neatly skip missing values. Plus I think this use of dispatch makes code very hard to maintain, if that matters.
Are you sure? I'm not saying we should keep dispatch based on AbstractWeights
, but given that it's an AbstractVector
it should work well with passmissing_vec
, whether it's passed as a positional or a keyword argument.
Are you sure? I'm not saying we should keep dispatch based on
AbstractWeights
, but given that it's anAbstractVector
it should work well withpassmissing_vec
, whether it's passed as a positional or a keyword argument.
Ah sorry, yes. But currently weights(x)
fails if x
has missing values. So we would have to change weights
first to make it work.
We've discussed passmissing_vec
with @bkamins and while it sounds promising, we identified a difficulty that deserves some discussion. It turns out that some statistical methods take several vectors whose entries do not correspond to observations: the best example is quantile(x, p)
.This is problematic as it means passmissing_vec(quantile)(x, p)
won't work. There are at least three solutions to this:
p
a keyword argument instead, and have passmissing_vec
only skip missing values for a hardcoded list of keyword arguments, which would include weights
.@passmissing quantile(x, $p)
.p
.AFAICT the number of functions in that case in StatsBase is relatively limited:
counts(x, levels)
and proportions(x, levels)
. Since they return a vector, we can't use them with passmissing_vecs
anyway as the result shouldn't be "spread" to the input vector length.fit(Histogram, data, edges)
is also affected, but fit
methods probably need a special treatment anyway, as functions that return a complex object will often want to handle missing values themselves (e.g. so that predict
for models fills entries with missing
for observations with missing values). Otherwise, edges
could become a keyword argument.autocov(x, lags)
, autocor(x, lags)
, crosscov(x, y, lags)
, crosscor(x, y, lags)
and pacf(x, lags)
would be misleading if used with passmissing_vecs
as the relative positions of values would change when skipping missing values. The only solution to have a special argument to indicate that only complete pairs of observations after applying the lag to x
should be used (this is what R does).partialcor(x, y, Z)
is OK when Z
is a vector, but not when it's a matrix (same for pacf(X, lags)
with a matrix). This raises the more general question of how to handle matrices (and Tables.jl tables), as we don't know whether observations are stored as rows or columns. But this isn't essential IMO, and we could use passmissing_vecs(f, dims=1/2)
or passmissingcols/passmissingrows
if we really want this.inverse_rle(vals, lens)
wouldn't work with passmissing_vecs
either, but I don't think it shouldSo overall maybe quantile
is the most notable exception to this.
Another corner case to consider are in-place functions like zscore!
. Passing a view of complete cases to it means that other entries will be left as they are, rather than set to missing
. Maybe that's OK but it's worth noting it.
- Use a more complex design with a macro and escape sequences to specify which arguments should not be modified, like
@passmissing quantile(x, $p)
We could also do passmissing(quantile)(x, Ref(p))
, like broadcasting (and the use of $
in @.
).
Let me add one more thing. When do we need something like passmissing_vec
?
To answer this note the following:
skipmisisng
wrapper is good enough and we already have it;This means that we need passmissing_vec
only if:
cor(x, y)
(with or without weights)Now the challenge is that general Tables.jl tables will not be able to support passmissing_vec
as the interface does not specify a way to drop observations, so we would be left with with vectors or matrices (for which we should add some kind of dims
kwarg as @nalimilan noted).
As an additional note some functions will still need a special treatment of missings, e.g. cor
(or in general pairwise
) needs it (as it can do whole observation or pairwise missing removal) which cannot be handled just by using a wrapper.
Having said all that I have the following thought:
passmissing_vec
function for methods that are not missing-aware.skipmissing
kwarg to them (just like we would add the weights
kwarg where function needs to be weighting aware)@nalimilan - I know I am going back to what we already discussed but I really feel that most of typical users will have a much easier time learning to use skipmissing
kwarg. Can we try to put a list of arguments against making skipmissing
a kwarg in this thread? (note that having a kwarg is not conflicting with passmissing_vec
- as it still could be added to make it easier to handle functions that are not missing aware)
@nalimilan - I know I am going back to what we already discussed but I really feel that most of typical users will have a much easier time learning to use
skipmissing
kwarg. Can we try to put a list of arguments against makingskipmissing
a kwarg in this thread? (note that having a kwarg is not conflicting withpassmissing_vec
- as it still could be added to make it easier to handle functions that are not missing aware)
I think the main argument against a skipmissing
keyword argument is that it requires specific support in each function. That might be OK if it was needed only for multiple-argument functions, but if we want to support weighting that means adding that argument to basically all functions in Statistics (not all of them currently support weights, but ideally they would at some point). And even then, many packages may not accept adding such an argument (e.g. it would probably not be accepted for sum
in Base), so users would have to know how to use different syntaxes depending on functions.
Conversely, one of the advantages of passmissing_vec
is that we can easily try that approach by adding a function to Missings.jl, and if it doesn't work we can implement another solution, and just deprecate it or let people ignore it.
Let me ask in #data and #statistics.
Now the challenge is that general Tables.jl tables will not be able to support
passmissing_vec
as the interface does not specify a way to drop observations, so we would be left with with vectors or matrices (for which we should add some kind ofdims
kwarg as @nalimilan noted).
What statistics functions work on Tables.jl objects which don't already have an internal handling of missing
, though? Do you mean something like fit(model, df; weights = ...)
?
What statistics functions work on Tables.jl objects
My comment is that in general there is an expectation to potentially allow Tables.jl objects in the functions provided (and the discussion how to do it has not happened yet).
It's great to see Julia's evolution in the space of missingness handling, and the thoughtful discussions that underlie them. I have a thought I'd like to get some of your expert feedback on, and I'm hoping this is the right place for it.
I consider among Julia's biggest data-analytic advantages over its competitors that Julia functions respect missingness rather than hiding it.
But as @nalimilan writes,
- use a similar syntax for simple functions operating on vectors (like mean) and complex functions operating on whole tables (like fit(MODEL, ..., data=df, weights=w) and which skip missing values by default)
there is a whole class of important functions that ignore missing data. When I'm using these functions, I'm always nervous I'll make a mistake about which rows I'm using.
In my applications, values are not "missing completely at random", and assuming they are will lead me to incorrect analytic conclusions. An analysis that drops rows missing e.g. income values may overlook a significant relationship if the missingness is correlated with other features of the data.
Furthermore, even if the top-line results are fine, I like to attend to missingness even in intermediate steps like model selection. As Richard McElreath puts it in Statistical Rethinking,
By removing the cases with missing values at the start, we accomplish the most important step in model comparison: Compared models must be fit to exactly the same observations. If you don’t remove the incomplete cases before you begin fitting models, you risk comparing models fit to different numbers of observations. This is a serious risk, because R’s more automated model fitting routines, like lm, will automatically and silently drop incomplete cases for you. If one model uses a predictor that has missing values, while another does not, then each model will be fit to a different number of cases. The model fit to fewer observations will almost always have a better deviance and AIC/DIC/WAIC value, because it has been asked to predict less.
Getting missing values in results or error messages will alert me that I need to be concerned about missingness in my dataset.
There are many possibilities for how to be explicit about it when that is desired, e.g. fit(m, dropmissing(df))
, fitnonmissing(m, df)
, or -- ideally -- handling missingness in a separate step prior to fitting, and propagating missings or raising errors otherwise. But I would feel more confident in my statistics if my main fitting API were missing-safe.
I would love to hear how you think about this issue and how I might operate under the next statistics API.
Yes we should probably not drop rows with missing values silently when fitting models as it's inconsistent with the rest of Julia. fit(m, dropmissing(df))
is problematic as it will drop lots of rows if df
contains columns that are not used in any of the fitted models, though. I'd be more inclined to have a skipmissing::Bool
keyword argument. We should also provide user-friendly APIs to fit multiple models in a single cal, e.g. by passing a vector of formulas. Stata's nestreg
is an interesting precedent.
(Note that none of this is defined in Statistics. That's really a separate issue IMO since it cannot be handled using passmissing_vec
as discussed above.)
Yes, the issue is that how and which missings should be dropped is a analytical procedure specific information and has to be handled internally by it.
But it has a consequence for a general design that maybe we in the end need to accept that there will be a skipmissing
kwarg in many functions and the passmissing_vec
will be just an alternative, in particular for functions that do not have this kwarg implemented.
In an ideal world, this goes in the model. The ideal model is something like
formula = @formula(y ~ x | fe = absorb(x, t) | cluster(state) | dropmissing(all) | weight(w))
fit(formula, df)
But that's something for the very long term and dedicated funding.
We currently have an efficient and consistent solution to skip missing values for unweighted single-argument functions via
f(skipmissing(x))
. For multiple-argument functions likecor
we don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87).A reasonable solution would be to use
f(skipmissing(x), weights=w)
, with a typical definition being:That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in
skipmissing
. A wrapper likeskipmissing(weighted(x, w))
(inspired by what was proposed at https://github.com/JuliaLang/julia/pull/33310) would be cleaner in that regard. But that would still be quite ad-hoc, asskipmissing
currently only accepts collections (andweighted
cannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods.The generalization to multiple-argument functions poses the same challenges as
cor
. For these, the simplest solution would be to use askipmissing
keyword argument, a bit likepairwise
. Again, the alternative would be to use wrappers likeskipmissing(weighted(w, x, y))
.Overall, the problem is that we have conflicting goals:
f(skipmissing(x))
f(skipmissing(x))
vsf(skipmissing(x), weights=w)
, orf(skipmissing(x))
vsf(skipmissing(weighted(x, w)))
, orf(x, skipmissing=true)
vsf(x, skipmissing=true, weights=w)
f(skipmissing(x))
vsf(skipmissing(x, y))
, orf(x, skipmissing=true)
vsf(x, y, skipmissing=true)
mean
) and complex functions operating on whole tables (likefit(MODEL, ..., data=df, weights=w)
and which skip missing values by default)