JuliaStats / StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
248 stars 31 forks source link

Unified framework for dealing with Nulls? #17

Open mkborregaard opened 7 years ago

mkborregaard commented 7 years ago

Currently, lm seems to deal with NAs in DataFrames by ignoring the rows where they appear:

df = DataFrame(a = randn(10), b = randn(10));
df[:a][2] = NA
mod = lm(a~b, df)
mod.model.pp.X

This is great, but is there a general syntax for this behaviour? I'm asking, among other things, because I want to add a keyword for this behaviour to plotting with DataFrames in StatPlots.

In R, this is a complete mess. mean uses the keyword na.rm = T, lm uses na.action = "na.omit", cor uses `use = "complete.obs", which is just the sort of thing that guarantees that half your coding time in R is spent reading man pages also after 10 years of use.

nalimilan commented 7 years ago

For functions like mean, for now has to do mean(dropnull(x)), which is consistent but makes a copy. A keyword argument skipnull=true (like for reduce in NullableArrays) or skipna=true (like in DataArrays) would work, but we could also imagine a more generic system where a dropnull or skipnull function would return a wrapper iterator skipping nulls.

Anyway it's not clear whether this system would be the same with statistical models. If the question is as simple as for mean, the same keyword argument could be used. But maybe it's more complex. R's na.action system allows four choices: skip nulls (default), skip nulls but use null in the predict output for corresponding rows, keep nulls (i.e. no-op), throw and error if there is a null. I'm not convinced this is needed:

So overall, maybe we just need to standardize a keyword argument name, like skipnull? We should review what other software does.

EDIT: Forgot to address the case of pairwise functions like cor and cov, where more choices are possible. I think we've discussed them in the past, need to find the link.

mkborregaard commented 7 years ago

Yes, sounds very sensible. Ah, and what's the current policy for predict? I believe I should not have to worry about it for https://github.com/JuliaStats/DataFrames.jl/pull/1160.

nalimilan commented 7 years ago

I forgot to address the case of pairwise functions like cor and cov, where more choices are possible: skip all rows with at least in null in one of the columns (cor(dt[completecases(dt), :])), or only skip those with a null in one of the two variables considered for each computation. This has been discussed before on julia-stats. I would say these functions could still take the skipnull=true argument for consistency with others, but possibly combined with another argument like pairwise_complete=true to consider only the relevant columns when skipping.

But this actually has really little or no relation with modeling functions. :-)

Regarding predict, AFAIK the current policy is simple: since only complete observations are kept in the model matrix, no missing values can appear in the output of predict. I think it would be useful to store the list of removed cases so that a missing value that can be returned for them when needed. Could be as simple as storing a list of indices (not sure whether storing a second model matrix with nulls where information is unavailable would be needed for some applications).

mkborregaard commented 7 years ago

Cool. Yes, e.g. df = DataFrame(x = x, y = y); mod = lm(y~x, df); plot(df, :y, predict(mod)) would be nice to be able to do even in the presence of NAs (of course you can already do plot(df, :y, predict(mod, df)) so maybe not the most illustrative example).

mkborregaard commented 7 years ago

It is a bit tricky perhaps to get the skipnull = true and pairwise_complete=true combination to be intuitive (would setting pairwise_complete=true lead to fewer or more observations to be included? what if skipnull = false?). Perhaps allowing skipnull to take true, false and, where applicable, :pairwise_complete would be an option (with :none an alias for false and :all an alias for true? Plots.jl uses this kind of logic in many places but maybe it would appear too magical.

nalimilan commented 7 years ago

Yes, there's still the option of allowing both Bool and Symbol. Or we could find a better name for the argument. Anyway, better discuss this in the right packages (DataArrays and NullableArrays).

andreasnoack commented 7 years ago

There is some iterator functionality for handling this in DataArrays. I don't know if it practical but I'd love to see some alternative way to specify this that isn't a keyword argument. Keyword arguments are so R like. I don't have a specific proposal though. Just wanted to mentioned it.

nalimilan commented 7 years ago

For reduction functions like mean, an iterator can easily be used. For pairwise functions like cor, it would be tricky to handle the "pairwise missing deletion" since the result cannot be a data table-like structure: deletion needs to happen inside the function itself. Though for this case (which might not be a so frequent need), we could have a generic pairwise function (like the one in Distances.jl) which would take the arguments we talked about. That way at least we wouldn't need to reimplement them several times. Standard incomplete case deletion could be done by users via cor(dropnull(dt)) or cor(dt[complete_cases(dt),:]).

mkborregaard commented 7 years ago

I actually recently put a generic pairwise function in VectorizedRoutines.jl if that's relevant https://github.com/ChrisRackauckas/VectorizedRoutines.jl/pull/9/files . You'd be welcome to copy over any elements of it (given that I'm pretty sure you don't want a dependency). I got a lot of useful feedback on the PR from Steven Johnson, so the worst elements should have been ironed out.

nalimilan commented 7 years ago

Interesting, but that pairwise function operates on individual elements of a vector, not on columns or rows. Looks like it could be generalized by passing the dimension to slice as the third argument. pairwise(f, a, 2) would behave like pairwise(dist, a) from Distances.jl, and the two functions could be merged (with a definition in StatsBase?).

mkborregaard commented 7 years ago

That is a good idea - the function has always been intended to be extended to multiple dimensions. Which design do you think would be preferable - one where f always returns a scalar and the return type is a Matrix or one that allows return types of any dimension N and concatenates it to an N+2 Array? I think the first, just making sure.

nalimilan commented 7 years ago

I would start assuming a scalar. We can always make it even more general if that turns out to be useful.