JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Skipping missing values more easily #2314

Open nalimilan opened 4 years ago

nalimilan commented 4 years ago

It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like filter(:col => x -> coalesce(x > 1, false), df) or combine(gd, :col => (x -> sum(skipmissing(x))) isn't ideal. One proposal to alleviate this is https://github.com/JuliaData/DataFrames.jl/issues/2258: add a skipmissing argument to functions like filter, select, transform and combine to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.

That would be one step towards being more user-friendly, but one would still have to repeat skipmissing=true all the time when dealing with missing values. I figured two solutions could be considered to improve this:

Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at https://github.com/JuliaStats/DataArrays.jl/issues/39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the DataFrame level is that it will have no effect on operations applied directly to column vectors, like sum(df.col). But that's better than nothing.

Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard

pdeffebach commented 4 years ago
  • Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g. @linqskipmissing macro or a statement like skipmissing within a @linq block that would automatically pass skipmissing=true to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.

This is a good idea, however I also like skipmissing = true at the level of a transform call or even at the level of argument because it's explicit.

Perhaps DataFramesMeta could provide a block-level skipmissing option as well as a macro like

@transform(df, @sm y = :x1 .+ mean(:x2))

Where @sm macro, does the necessary transformation described in #2258

function wrapper(fun, x, y)
    sx, sy = Missings.skipmissings(x, y) # need to be on Missings master
    sub_out = fun(sx, sy)
    full_out = Vector{Union{eltype(sub_out), Missing}}(missing, length(x))
    full_out[eachindex(sx)] .= sub_out # eachindex(sx) returns indices of complete cases

    return full_out
end
  • Have a field in DataFrame objects that would store the default value to use for the skipmissing argument. By default it would be false, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to call skipmissing!(df, true) once and then avoid repeating it.

I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.

matthieugomez commented 4 years ago

It's great you're thinking about how to make working with missing values easier! I 100% agree.

A macro may be good.

I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like df = merge(df, df_using) does not retain the option?

A third possibility would be to allow users to change the default option for transform, etc, say by writing SKIPMISSING = true at the start of a script.

c42f commented 4 years ago

To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?

I've long been frustrated with the difficulty of working with missing given that it infects all downstream operations and wished it worked differently in Base. (I acknowledge my frustration could be misguided — perhaps it's saved me from some horrible bugs without knowing it :-) )

nalimilan commented 4 years ago

Perhaps DataFramesMeta could provide a block-level skipmissing option as well as a macro like

@transform(df, @sm y = :x1 .+ mean(:x2))

Where @sm macro, does the necessary transformation described in #2258

@pdeffebach Yes that was more or less what I had in mind. Though if you have to repeat this for each call it's not much better than passing skipmissing=true. Being able to apply it to a chain of operations would already be more useful.

I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.

@pdeffebach Yes. OTOH it's not so different from e.g. creating a column: if you get an error because it doesn't exist or it's incorrect you have to find where it's been defined, which can be quite far from where the error happens.

I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like df = merge(df, df_using) does not retain the option?

A third possibility would be to allow users to change the default option for transform, etc, say by writing SKIPMISSING = true at the start of a script.

@matthieugomez Yes, losing the option after transformations could be annoying. Though it could be propagated across some operations which always preserve missing values: in these cases there's little point in forcing you to repeat that you know there are missing values. Where safety checks matter is when you could believe you got rid of missing values and for some reason it's not the case. But I admit that this option would decrease safety (even if not propagated automatically) as you could pass a data frame to a function which isn't prepared to handle missing values and it would silently skip them.

A global setting would only aggravate these issues IMHO since it would affect completely unrelated operations, possibly in packages, some of which may rely on the implicit skipmissing=false.

To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?

@c42f My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:

using Cassette, Missings

Cassette.@context PassMissingCtx

Cassette.overdub(ctx::PassMissingCtx, f, args...) = passmissing(f)(args...)

# do not lift special functions which already handle missing
for f in (:ismissing, :(==), :isequal, :(===), :&, :|, :⊻)
    @eval begin
        Cassette.overdub(ctx::PassMissingCtx, ::typeof($f), args...) = $f(args...)
    end
end

f(x) = x > 0 ? log(x) : -Inf

julia> Cassette.@overdub PassMissingCtx() f(missing)
missing

julia> Cassette.@overdub PassMissingCtx() f(1)
0.0

julia> Cassette.@overdub PassMissingCtx() ismissing(missing)
true

julia> Cassette.@overdub PassMissingCtx() missing | true
true

Skipping missing values in reductions will be a little harder, but it's doable if we only want to handle a known list of functions. For example this quick hack works:

Cassette.overdub(ctx::PassMissingCtx, ::typeof(sum), x) = sum(skipmissing(x))

julia> x = [1, missing];

julia> Cassette.@overdub PassMissingCtx() sum(x)
1

Maybe this kind of thing could be made simpler to use by providing a macro like @passmissing as a shorthand for Cassette.@overdub PassMissingCtx(). But it's important to measure all the implications of this approach: since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects. For example, Cassette.@overdub PassMissingCtx() [missing] gives missing, which will break lots of package code. A safer approach would be to only apply passmissing to a whitelist of functions for which it makes sense (scalar functions, mainly), like DataValues does -- with the drawback that the list is kind of arbitrary.

In the end, maybe Cassette is too powerful for what we actually need in the context of DataFrames. In practice with select/transform/combine it makes sense to only apply passmissing to the top-level functions, rather than recursively. And for reductions, passing views of complete rows as proposed at https://github.com/JuliaData/DataFrames.jl/issues/2258 should be enough, when combined with a convenient DataFramesMeta syntax.

c42f commented 4 years ago

My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:

Very cool. IIUC people have started to use other libraries like IRTools which plug into the compiler in a similar way to Cassette but are not Cassette itself, but yes, that's roughly what I had in mind. One downside is that it's a pretty heavy weight tool to deploy for something like missing.

since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects

I think this is the bigger problem; it's not clear how deep to recurse and it could definitely have unintended consequences. It's a very similar problem with floating point rounding modes, where having the rounding mode as dynamic state really doesn't work well as it infects other calculations which weren't programmed to deal with a different rounding mode.

So I agree it does seem much safer and more sensible to have a macro which lowers only the syntax within the immediate expression to be more permissive with missing. Actually I wonder whether you could do something more like broadcast lowering where all function calls within the expression are lifted in a certain way such that dispatch can be customized, so, eg, @linq f(x,y,z) becomes linq_missing(f, x, y, z).

In terms of your examples,

@linq filter(:col =>(x -> x > 1), df)
# means
linq_missing(filter, :col => (x -> linq_missing(>, x, 1)), df)

@linq combine(gd, :col => (x -> sum(x))
# means
linq_missing(combine, gd, :col => (x -> linq_missing(sum, x)))

Something like this has very regular lowering rules and gives a measure of extensibility for user defined functions.

matthieugomez commented 4 years ago

One alternative is that filter/transform/combine could always skip missing (i.e. kwarg skipmissing = true is the default).

bkamins commented 4 years ago

Can you please help me understanding this proposal in detail? What should be the outcome of the following operations if we added skipmissing option (now I am showing what we get now and would like to understand what we should get if we add skipmissing=true):

julia> df = DataFrame(x=[1,missing,2])
3×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ missing │
│ 3   │ 2       │

julia> select(df, :x)
3×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ missing │
│ 3   │ 2       │

julia> select(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ missing    │
│ 3   │ 2          │

julia> select(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ missing    │
│ 3   │ 2          │

julia> combine(df, :x)
3×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ missing │
│ 3   │ 2       │

julia> combine(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ missing    │
│ 3   │ 2          │

julia> combine(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ missing    │
│ 3   │ 2          │
nalimilan commented 4 years ago

AFAICT this question regards #2258, better keep matters separate as discussions are already tricky enough.

matthieugomez commented 4 years ago

With select, skipmissing would apply the functions on row for which none of the argument is missing, return the output on these rows, and missing otherwise (to maintain the same number of rows as the original dataframe). So skipmissing = true would not change anything in your example. Similarly for transform.

With combine, skipmissing would apply the functions on rows for which none of the argument is missing.

julia> combine(df, :x)
3×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ 2       │

julia> combine(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ 2          │

julia> combine(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ 2          │

That being said, I am not sure why combine(df, :x) or combine(df, :x => ByRow(.)) are accepted and it sounds that maybe it would be clearer to disallow them?

bkamins commented 4 years ago

Ah - sorry - I thought this issue "overriden" that one. So actually this issue should be on hold till we resolve #2258, because it only asks to make #2258 simpler - right?

matthieugomez commented 4 years ago

Could we think about making filter/transform/combine/select have the kwarg skipmissing = true by default? It's not unheard of — that's what Stata and Panda (for combine) do. Also, data.table and dplyr automatically skip missing in their versions of filter.

bkamins commented 4 years ago

I would not have a problem with this. We already have it in groupby so it would be consistent. So I assume you want:

  1. for filter wrap a predicate p in x -> coalesce(p(x), false)
  2. for transform/combine/select we have two decisions:
    • do we pass views or skipmissing wrappers? (given the second question probably views, it is going to be a bit slow, but I think skipmissing=true does not have to be super fast as it is a convenience wrapper)
    • what do we do if multiple columns are passed - skip rows that have at least one missing? (this is what groupby does)
nalimilan commented 4 years ago

Regarding 2, note that as discussed at https://github.com/JuliaData/DataFrames.jl/issues/2258 we have to pass views for select and transform as we need to be able to reassign values to the non-missing rows in the input. When multiple columns are passed, we should only keep complete observations, otherwise something like [:x, :y] => + wouldn't work due to different lengths.

nalimilan commented 4 years ago

I should have noted that before thinking about making it the default, we should first implement this keyword argument and see how it goes.

matthieugomez commented 4 years ago

@nalimilan Yes starting with a kwarg and see how it goes is the right way. The only reason I was mentioning the default value is that 1.0 may mean that this kind of stuff won't be able to change later on — I hope it's not the case.

@bkamins I agree with 1 and 2.1 (views). I think for the case of [:x, :y] => + it should skip missing on both (as @nalimilan points out), but for [:x, :y] .=> mean, or for :x => mean, :y => mean, it should skipmissing on x and y separately.

bkamins commented 4 years ago

Yes - for [:x, :y] .=> mean this is separate (which means that transform/select will throw an error in cases when there is no match and a vector is returned).

OK - so it seems we have a consensus here. I will propose a PR.

bkamins commented 4 years ago

Just a maintenance question - when we add skipmissing kwarg should both #2314 and #2258 be closed or something more should be done/kept track of?

matthieugomez commented 4 years ago

That would be awesome, thanks! Jus to make sure I understand, what do you mean by `transform/select will throw an error in cases when there is no match and a vector is returned).'?

If this happens, https://github.com/JuliaData/DataFrames.jl/issues/2258 should definitely be closed, but this thread should be open if the default value is false, since it may still be cumbersome to add it at every command.

bkamins commented 4 years ago

transform/select will throw an error in cases when there is no match and a vector is returned

I mean tat this would still error:

julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing])
3×2 DataFrame
│ Row │ a       │ b       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ 1       │
│ 2   │ missing │ missing │
│ 3   │ 2       │ missing │

julia> select(df, :a => collect∘skipmissing)
ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.

(but I guess this is natural to do it this way)

this thread should be open if the default value is false

OK - we can keep it open then. For 1.0 release the default will be false to be non-breaking.

bkamins commented 4 years ago

That would be awesome, thanks!

Started thinking about it :). The trickiest part will be fast aggregation functions for GroupedDataFrame case (as usual), but also they should be doable.

matthieugomez commented 4 years ago

transform/select will throw an error in cases when there is no match and a vector is returned

I mean tat this would still error:

julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing])
3×2 DataFrame
│ Row │ a       │ b       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ 1       │
│ 2   │ missing │ missing │
│ 3   │ 2       │ missing │

julia> select(df, :a => collect∘skipmissing)
ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.

(but I guess this is natural to do it this way)

Just to clarify, I think the following should work:

select(df, :a => collect, skipmissing = true)

and actually also

select(df, :a => collect∘skipmissing , skipmissing = true)

since the function collect∘skipmissing is one to one on the set of values for which a is not missing. In both cases, it just returns the same thing as a.

However, the following should error

combine(df, :a => collect, :b => collect, skipmissing = true)

because the length of collect∘skipmissing(a) is different from the length of collect∘skipmissing(b)

bkamins commented 4 years ago

Just to clarify, I think the following should work:

select(df, :a => collect, skipmissing = true)

Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:

select(df, :a => collect , skipmissing = true)

to work but

select(df, :a => collect∘skipmissing)

will fail.

Which means to me that if we want to add such a kwarg it should not be called skipmissing as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved for select and transform as in combine you expect a different behaviour.

In particular:

select(df, :a => mean, skipmissing = true)

and

select(df, :a => mean∘skipmissing)

would both work but produce different results.

pdeffebach commented 4 years ago

My proposal above about "spreading" missing values seems relevant here.

select(df, :a => collect, skipmissing = true)

This takes :a, applies skipmissing, collects the result, then loops through indices of :a and, when not missing`, fills it in.

select(df, :a => collect∘skipmissing , skipmissing = true)

This takes :a, applies skipmissing, then applies it again, collects the result, and loops through indices of :a as before.

select(df, :a => mean, skipmissing = true)

Takes :a, applies skipmissing, takes the mean, then loops through and fills in indices of :a where indices of :a are not missing. Perhaps we can make an exception for scalar values? If a scalar is returned, it gets spread across the entire vector, regardless of missing values? That way it would match

select(df, :a => mean∘skipmissing; skipmissing = false)

EDIT: After playing around with R, now I'm not so sure. I think the biggest annoyance is with filter currently and we should start with that since everyone agrees on that behavior and we are confident it won't result in unpredictable / inconsistent behavior.

matthieugomez commented 4 years ago

Just to clarify, I think the following should work: select(df, :a => collect, skipmissing = true)

Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:

select(df, :a => collect , skipmissing = true)

to work but

select(df, :a => collect∘skipmissing)

will fail.

Which means to me that if we want to add such a kwarg it should not be called skipmissing as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved for select and transform as in combine you expect a different behaviour.

In particular:

select(df, :a => mean, skipmissing = true)

and

select(df, :a => mean∘skipmissing)

would both work but produce different results.

Exactly. Even though I don’t like different keyword argument, I see the potential confusion for select/transform.

bkamins commented 4 years ago

And I see the uses of both functionalities :smile:, simply they should have a different name. Actually we can have both, i.e. skipmissing doing "hard" skipmissing everywhere and something else which passes missings, and the name could be passmissing as passmissing in Missing.jl does exactly this kind of thing.

nalimilan commented 4 years ago

Yes, passmissing would be a more accurate name. OTOH there would be reasons to use skipmissing:

Overall I'm not sure which choice is best. As suggested by @pdeffebach, we could have skipmissing with a special case for functions that return a single scalar, in which case we would fill even rows with only missing values: this is justified because there's no correspondence between input and output rows in that case (like with combine). We could also just throw an error telling to use mean∘skipmissing for now.

nalimilan commented 4 years ago

I think I've found a possible solution. We could add a passmissing=true argument to select and transform (as described above). The default to true would be fine for 99% of use cases and would probably not break a lot of code (if any). It would also be consistent with the approach Julia takes regarding missing values: they either propagate or throw an error, but are never skipped silently by default.

Then we would also add a skipmissing=false argument to combine and filter, consistent with groupby. People would have to pass skipmissing=true or use skipmissing(...) manually. That's a bit annoying, but it's safe as missing values will never be dropped silently even if they were propagated silently due to the passmissing=true default. Since passmissing=true by default, users won't have to use that argument often (or never), so they won't be confused by the existence of both passmissing and skipmissing.

How does that sound? The main drawback of this approach is the inconsistency between select/transform and combine, but that's probably OK as these are quite different operations anyway.

matthieugomez commented 4 years ago

That’s interesting.

I was leaning on the opposite solution: skipmissing = true default for filter and combine, passmissing = false default for transform.

Skipmissing = true for filter is consistent with other languages (at least stata/dplyr/data.table), and that is what people want in 100% of cases.

Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.

In contrast, passmissing = true in transform/select would be unique to Julia. It means that stuff like :x => ismissing will give missing instead of true for missing values. I still think it is worth having it but it may be harder to grasp conceptually.

So, in conclusion, my preferred solution would be skipmissing = true for combine and filter, and passmissing = true for select/transform. I get your point about compatibility with the spirit of missing in Base but I don’t see it as that important.

pdeffebach commented 4 years ago

Sorry what is passmissing? the concept of passmissing only makes sense to me in ByRow where you could conceivably have missing as an input instead of a vector.

matthieugomez commented 4 years ago

@pdeffebach I was using for the behavior I was describing for select/transform, whether it is actually called passmissing or skipmissing.

pdeffebach commented 4 years ago

Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.

I'm not sure that's true. The problem Milan is describing is

combine(groupby(df, :a), :x => t -> first, :y => t -> first, skipmissing = true)

In Stata you would just get missings if the first observation is missing. But you would be confident that this was truly the first observation for every group.

Under your proposal, the functions would expand to be

combine(groupby(df, :a), :x => t -> first(skipmissing(t)), :y => t -> first(skipmissing(t))

If :x has missing as the first element, then you are going to get the second element of :x and the first element of :y. This will definitely lead to bugs.

matthieugomez commented 4 years ago

That’s right, I did not think about that.

Note that, the way I understand it, first in select, with passmissing = true, would suffer for the same problem (ie it would give the first non missing).

Can you think of other examples that may give surprising results in combine or select/transform? It would be nice to do a list (beyond first and ismissing).

On Tue, Aug 25, 2020 at 7:03 AM pdeffebach notifications@github.com wrote:

Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.

I'm not sure that's true. The problem Milan is describing is

combine(groupby(df, :a), :x => t -> first, :y => t -> first, skipmissing = true)

In Stata you would just get missings if the first observation is missing. But you would be confident that this was truly the first observation for every group.

Under your proposal, the functions would expand to be

combine(groupby(df, :a), :x => t -> first(skipmissing(t)), :y => t -> first(skipmissing(t))

If :x has missing as the first element, then you are going to get the second element of :x and the first element of :y. This will definitely lead to bugs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-680044858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNJZKXBPXFC4LORTWDSCPAB7ANCNFSM4OTTLDUA .

nalimilan commented 4 years ago

Whether skipmissing=true take into account missings in all columns or only to those passed to each function is an interesting design decision to make. But AFAICT it's orthogonal to whether we should use passmissing=true and skipmissing=false by default or not. ~And it doesn't concern passmissing=true, which would not apply to combine~ EDIT: it does if we allow broadcasting a scalar result to all non-missing rows.

The main question for choosing the defaults is whether we favor safety or convenience. I'm always torn between these two goals, which is why I tried to proposed in the OP various solutions that don't require silently ignoring missing values by default.

Sorry what is passmissing? the concept of passmissing only makes sense to me in ByRow where you could conceivably have missing as an input instead of a vector.

@pdeffebach It also works for any function passed to select and transform: we just pass a view of non-missing rows, and copy the resulting vector to the non-missing rows, filling with missing other entries. This is what you proposed above.

matthieugomez commented 4 years ago

I think @pdeffebach was just disagreeing with my claim that combine with skipmissing = true is what Stata does (while mean ignores missing values, first gets the first element of the column, whether it’s missing or not)

matthieugomez commented 4 years ago

We could also define the kwarg maskmissing, and use it for filter/combine/transform, rather than using a combination of passmissing/skipmissing kwarg.

The definition of maskmissing is that it applies the operation on rows where columns are non missing. This explanation is consistent with what we want for filter/combine/transform, and so this would allow to get the same kwarg for all of these functions. (of course, for combine, it is exactly the same as skipmissing).

pdeffebach commented 4 years ago

The thing is, people don't really complain about na.rm = TRUE though, right? I feel like most of the problems we see about missing are really easily fixed by pointing people to skipmissing and passmissing.

combine(gd, :col => (x -> sum(skipmissing(x)))

Is really ugly, yes, but in the future you will be able to write

using DataFramesMeta
@combine(gd, y = sum(skipmissing(col))

which is not that bad.

matthieugomez commented 4 years ago

For me the problem is that for now it requires different strategies for filter (which requires coalesce), combine (which requires skipmissing). Another issue with combine is the whole thing about reduction other several vectors with different missing rows such as weighted mean, correlation etc )

Finally, transform requires sometimes passmissing, sometimes skipmissing, coalesce for ifelse calls, something for tryparse, etc. Also, there’s no good way currently to handle functions with vector input and vector output such as cumsum.

.

pdeffebach commented 4 years ago

That's fair. It would be hard to tackle each of these individually,

  1. filter should apply coalesce by default.
  2. skipmissing = true should apply passmissing to ByRow functions
  3. We need a lazy broadcasted ifelse that skips bad elements
  4. Better tryparse or a whole destring package which would take a lot of work

Then there’s the whole thing about reduction other several vectors with different missing rows such as weighted mean, correlation etc

Also an issue, skipmissings helps weighted mean, but unfortunately cor didn't make it into Statistics.

Okay that's fair. I'm on board with a keyword argument with a variety of behaviors depending on the context.

bkamins commented 4 years ago

I will take a defensive stance here.

Before I state what I think I will explain the principles of my judgement:

  1. DataFrames.jl should provide primitives that are flexible enough to perform what user needs in a performant way
  2. Other packages (like DataFramesMeta.jl or DataFramesMacros.jl) can add convenience wrappers; this does not mean that we do not want to be convenient in DataFrames.jl, but this is not a priority
  3. We take "safety first" and "consistence with Julia Base" policy
  4. We do not want to be breaking unless it is really justified

Given this:

  1. filter should apply coalesce by default.

I think we can add skipmissing kwarg with false as default. This is what Base does. If we can convince Julia Base to change the behaviour here then we will sync with that behaviour.

  1. skipmissing = true should apply passmissing to ByRow functions

Do you mean it for select/transform or also for combine? (also see my comment below)

  1. We need a lazy broadcasted ifelse that skips bad elements
  2. Better tryparse or a whole destring package which would take a lot of work

Agreed, but it is out of scope of DataFrames.jl. I would also add x -> coalesce(x, false) to the list with some convenient name, I think coalescefalse is explicit but maybe you judge it to be too long. I would add this to Missings.jl.

but unfortunately cor didn't make it into Statistics.

I think it is an open issue how to do it rather than closed issue as there are many ways to compute cor on a matrix with missings.

(x -> sum(skipmissing(x)))

this is normally written as sum∘skipmissing and it is not that bad I think.


Now regarding the design the more I think about it the more I want do the following that gives a fine grained control of what is going on in the function in a single transformation:

For skipmissing

would not be a kwarg but it would be an extension to the minilanguage we have for select, transform and combine. If in the form:

source => fun => destination

if you wrap source in skipmissing then fun in combine, select and transform gets views of passed columns with rows containing missings skipped.

In general the idea is that source => fun∘skipmissing is the same as skipmissing(source) => fun => destination if source is a single column, but we add this extension to allow easy handling of skipping rows in multiple-column scenarios

For filter we do not add anything but just write users to use coalescefalse∘fun if we agree to add such a function to Missings.jl.

For passmissing

It is not needed. Just tell users to write passmissing(their_function) - this is exactly what you want in ByRow and if I understand things correctly this is a main use case - is this correct?

pdeffebach commented 4 years ago

I think this is a good proposal.

We might not even need skipmissing in the meta-language. We can add a skipmissingsfun wrapper similar to passmissing that wraps all <:AbstractVector inputs in skipmissings. Though since it applies to the source I think a SkipMissing is more logical.

matthieugomez commented 4 years ago

@bkamins I am trying to understand your proposal:

bkamins commented 4 years ago

filter

My comment in the first part was just that I do not want skipmissing=true by default (no matter what syntax we apply).

My proposal is not to change filter at all but define coalescefalse(x) = coalesce(x, false) (the name can be discussed as this one is a bit long) and then just write:

filter(coalescefalse∘fun, df)

The beauty is that it would also work in Julia Base then which is a big benefit.

cumsum

I understand you want this:

julia> df = DataFrame(x=[1,2,missing,3])
4×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ 2       │
│ 3   │ missing │
│ 4   │ 3       │

julia> transform(df, :x => (x -> cumsum(coalesce.(x, 0)) + x .* 0) => :y)
4×2 DataFrame
│ Row │ x       │ y       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ 1       │
│ 2   │ 2       │ 3       │
│ 3   │ missing │ missing │
│ 4   │ 3       │ 6       │

Then I agree - I would not handle this in my proposal. I assume that passmissing is mostly for ByRow as noted above.

The point is that the use cases like cumsum I think are quite rare (maybe I am wrong here).


The general idea is to try using function composition and higher order functions rather than keyword arguments as this seems more flexible for the future.


We can add a skipmissingsfun wrapper similar to passmissing that wraps all <:AbstractVector inputs in skipmissings.

I was thinking about it, and initially judged that this is not possible as fun in source => fun => destination can get three things:

But now as I think about it actually it is not a problem as this higher order function could just handle these three cases separately in if-else blocks. It is better than my original proposal as it is more composable (select/transform/combine do not even need to know what it does - and this is good because it orthogonality in design makes it much easier to maintain in the long term) The question is what name could be used here. The issue is that skipmissing is defined in Julia Base and adding skipmissing(::Base.Callable) would be type piracy (though it is something that is invalid in Base currently - not sure what other name would be good though - maybe dropmissing?)

nilshg commented 4 years ago

@bkamins pointed me here, but unfortunately this thread is a bit longer than I've currently got time for. The reason I was pointed here was that I asked for comments on a proposal that touches on the topic of this thread: I am actually mostly in favour of a more cumbersome, force-the-user-to-think-carefully-and-be-explicit approach for the most part, my main gripe with missing is when I filter a DataFrame (i.e. use ==, >, <). In those cases, the additional verbosity required to get things to work is quite large, and to me at least the possibility of accidentally doing something unintended is low - I've never encountered a use case where in doing df.col .> 5 I wanted rows for which col is missing to be included in the result.

With these two considerations in mind, in my own code I started defining:

⋘(x, y) = (ismissing(x) | ismissing(y)) ? false : (x < y)
⋙(x, y) = (ismissing(x) | ismissing(y)) ? false : (x > y)
⩸(x, y) = (ismissing(x) | ismissing(y)) ? false : (x == y)

I clearly didn't spend much time choosing the opearators, I just wanted infix operators that look similar to but different from the regular comparison operators (one issue with my current selection is that their names are completely different: \verymuchless, \ggg and \equivDD aren't exactly easy to memorize as a group of operators...)

In any case this relatively simple case for me solves the main usability issue I encounter in day-to-day work.

matthieugomez commented 4 years ago

@bkamins you mentioned there might be a way to have ifelse, filter work with missing values in base. You really think so? Maybe that’s one way to go.

bkamins commented 4 years ago

Maybe that’s one way to go.

I meant that this issue could be opened in Julia Base and discussed there. I feel that if it would get support it would be 2.0 release the earliest. Therefore I think we should have a solution for now in DataFrames.jl that does not rely on Julia Base.

(still - if you feel it is worth discussing please open an issue on Julia Base; I just recommend to be patient - core devs tend to be conservative and require quite a lot of justification for changes)

EDIT Actually - given our discussions apart from ifeslse and filter the third big thing is getindex with Union{Missing, Bool}} eltype selector

nalimilan commented 4 years ago

I agree that it would also be OK to keep DataFrames as a relatively low-level package which doesn't offer the most convenient syntax, as long as it provides the building blocks to do anything. Anyway it's clear that the "mini-language" will never be as intuitive as DataFramesMeta's macros. So maybe we should try to see at the same time how we want to make working with missing values easy in DataFramesMeta, and see what we need in DataFrames to allow implementing it.

Adding skipmissing(f::Callable) to Base sounds doable. I'm less sure about filter and getindex skipping missing values, as this goes against what happens everywhere else (missing values are never dropped silently). But we don't need that for DataFrames/DataFramesMeta AFAICT.

The difficulty with finding a good solution in DataFramesMeta is that the equivalent implementations like dplyr rely on the fact that functions are vectorized and/or propagate missing values in totally ad-hoc way. For example, R/dplyr allows mutate(df, y = isna(x) ? "0" : as.character(x)), where as.character propagates missing values, but isna does not. R/dplyr also allows mutate(df, y = x - mean(x)) since - is vectorized, but we require .- in Julia -- which is OK here but can get quite unwieldy with complex commands. The fact that Julia is much more systematic is a strength for advanced users, but it can also be a barrier for more basic use, as R usually does "the right thing" (but when it doesn't it's a pain).

Maybe we could handle this difficulty by having all macros in DataFramesMeta operate row-wise by default, and require a special syntax to indicate that a column should be accessed as a vector. For example, @transform(df, y = x - mean($x)). Then it seems to me that the most intuitive behavior is this:

AFAICT this would do "the right thing" in the following situations:

Cases that wouldn't do "the right thing" by default are uses of ismissing (as missing values would always be skipped by default). We decide to be smart and set automatically passmissing=false or skipmissing=false when an expression contains ismissing, or just throw an error with an informative message. A more tricky case are logical operators which implement three-valued logic: with passmissing=true, true | missing would be assigned missing since | would not actually be called. Maybe that's not a big deal.

matthieugomez commented 4 years ago

I was hoping that DataFramesMeta would just be DataFrames + a macro to create Pairs expressions (e.g. DataFramesMacros), but maybe that's the wrong way to look at it. My concern is that having a barebone DataFrames effectively creates 3 different syntaxes: the simple df.a syntax, the transform syntax, and the @transform syntax.

I need to think more about these proposals. In the meantime, I just wanted to point out that Stata has two different commands for row-wise vs, column-wise transforms (gen v.s. egen). That is something to think about if that would simplify thing, e.g. there could be a row version of transform, called, say, alter.

On Wed, Aug 26, 2020 at 9:41 AM Milan Bouchet-Valat < notifications@github.com> wrote:

I agree that it would also be OK to keep DataFrames as a relatively low-level package which doesn't offer the most convenient syntax, as long as it provides the building blocks to do anything. Anyway it's clear that the "mini-language" will never be as intuitive as DataFramesMeta's macros. So maybe we should try to see at the same time how we want to make working with missing values easy in DataFramesMeta, and see what we need in DataFrames to allow implementing it.

Adding skipmissing(f::Callable) to Base sounds doable. I'm less sure about filter and getindex skipping missing values, as this goes against what happens everywhere else (missing values are never dropped silently). But we don't need that for DataFrames/DataFramesMeta AFAICT.

The difficulty with finding a good solution in DataFramesMeta is that the equivalent implementations like dplyr rely on the fact that functions are vectorized and/or propagate missing values in totally ad-hoc way. For example, R/dplyr allows mutate(df, y = isna(x) ? "0" : as.character(x)), where as.character propagates missing values, but isna does not. R/dplyr also allows mutate(df, y = x - mean(x)) since - is vectorized, but we require .- in Julia -- which is OK here but can get quite unwieldy with complex commands. The fact that Julia is much more systematic is a strength for advanced users, but it can also be a barrier for more basic use, as R usually does "the right thing" (but when it doesn't it's a pain).

Maybe we could handle this difficulty by having all macros in DataFramesMeta operate row-wise by default, and require a special syntax to indicate that a column should be accessed as a vector. For example, @transform(df, y = x - mean($x)). Then it seems to me that the most intuitive behavior is this:

  • all columns accessed as vectors (with $) are wrapped in a view of non-missing values by default (i.e. skipmissing=true)
  • if some columns are accessed row-wise (no $), set passmissing=true (since we are in a case equivalent to ByRow)
  • if a function applied to a column vector (accessed using $) returns a vector (with its length equal to the number of non-missing rows) rather than a scalar, set passmissing=true

AFAICT this would do "the right thing" in the following situations:

  • @transform(df, y = mean($x)): assign to all rows (including those with missing values) the mean of non-missing values
  • @transform(df, y = x - mean($x)): subtract to each non-missing row the mean of non-missing values
  • @transform(df, y = cumsum($x)): assign to rows with non-missing values the cumulative sum of values for these rows
  • @transform(df, y = s - cumsum($x)): subtract to each non-missing row the cumulative sum of values for these rows

Cases that wouldn't do "the right thing" by default are uses of ismissing (as missing values would always be skipped by default). We decide to be smart and set automatically passmissing=false or skipmissing=false when an expression contains ismissing, or just throw an error with an informative message. A more tricky case are logical operators which implement three-valued logic: with passmissing=true, true | missing would be assigned missing since | would not actually be called. Maybe that's not a big deal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-680993848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXI2TMYOSLOB3XBVAWTSCU3KHANCNFSM4OTTLDUA .

bkamins commented 4 years ago

@nalimilan - thank you for commenting on this. I guess it is best to move the discussion to DataFramesMeta.jl. My questions would be twofold:

@matthieugomez - I agree that we could have DataFramesMacros.jl (when it matures) be just an addition to DataFrames.jl and this should work. But for DataFramesMeta.jl I accept it adds new things (just as e.g. queryverse adds) - the user simply has to choose what one wants to use.

However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names lt, leq, gt, geq, eq defined like:

lt(x,y) = coalesce(x<y, false)

would be super convenient. I would propose to have them in Missings.jl. We would take hold of 5 very short names (bad thing) but would get flexibility in a place where it is much needed, also I think writing lt(x,y) is not that much worse than x < y. If we had these five functions I think that mostly we do not need coalescefalse, as in other cases user can just write x -> coalesce(x, false) since they will be rare anyway.

EDIT: if we went for this also neq

pdeffebach commented 4 years ago

I agree with @matthieugomez, especially with upcoming multithreading in transform I think it would be very hard to maintain feature parity if the two packages diverged too much.

I think @nalimilan's proposal is good, but is not functionally different from a DropMissing([:x1, :x2]) => fun syntax that could go in the data frames mini-language.

I do agree with Milan that ByRow being the default would solve a lot of these problems.

However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names lt, leq, gt, geq, eq defined like:

I think a very smart macro could fix this easier.

@missingfalse x > y
bkamins commented 4 years ago

I think a very smart macro could fix this easier.

@missingfalse x > y

this is clearly doable as it is the same as:

coalesce(x>y, false)

but just a bit longer to write :smile:

My point is - and this is how I understood what @nilshg wanted is to have something that is easy to type. And I feel that writing e.g. lt.(df.col1, df.col2) is not much worse than df.col1 .< df.col2 and much better than @missingfalse df.col1 .< df.col2 or coalesce.(df.col1 .< df.col2, false).