JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Skipping missing values more easily #2314

Open nalimilan opened 4 years ago

nalimilan commented 4 years ago

It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like filter(:col => x -> coalesce(x > 1, false), df) or combine(gd, :col => (x -> sum(skipmissing(x))) isn't ideal. One proposal to alleviate this is https://github.com/JuliaData/DataFrames.jl/issues/2258: add a skipmissing argument to functions like filter, select, transform and combine to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.

That would be one step towards being more user-friendly, but one would still have to repeat skipmissing=true all the time when dealing with missing values. I figured two solutions could be considered to improve this:

Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at https://github.com/JuliaStats/DataArrays.jl/issues/39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the DataFrame level is that it will have no effect on operations applied directly to column vectors, like sum(df.col). But that's better than nothing.

Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard

nalimilan commented 4 years ago

I just wanted to point out that Stata has two different commands for row-wise vs, column-wise transforms (gen v.s. egen). That is something to think about if that would simplify thing, e.g. there could be a row version of transform, called, say, alter.

@matthieugomez Indeed. My proposal is basically a way to combine Stata's gen and egen in a single expression, using $ to distinguish them. Anyway I think I would prefer specifying something like ByRow in @transform/@select (whatever the exact syntax) than a separate function, as it makes it easier to understand the API than adding functions with totally different names.

Not allowing mixing column-wise and row-wise operation like in my proposal would certainly be simpler. The main drawback would be that one would have to use broadcasted operations, e.g. x .- mean(x) rather than x - mean($x). Maybe that's OK but that increases the complexity.

* how would you turn-off this behaviour in DataFramesMeta.jl if it is not wanted

@bkamins I think you could just pass passmissing and skipmissing arguments to the macro, a bit like we discussed for functions.

* in `s - cumsum($x)` I was not clear what would happen in the scenario `s=[missing, missing, 1]` and `x=[1,2,3]`.

cumsum($x) would become cumsum([1]) == [1], and then we would call cs = cumsum([1]); x -> s - cs on each non-missing row (i.e. the last one with x == 3), filling the first two rows with missing.

However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names lt, leq, gt, geq, eq

I must say I'm not a fan of ad-hoc solutions like that. This would solve some of the issues, but not all of them and people will have to learn both the custom functions and the general mechanism using passmissing. Also we would lose the nice x < y syntax, which is really lousy for a high-level language. A macro like @missingfalse would be better, though it's still a bit verbose for DataFramesMeta I would say.

I agree with @matthieugomez, especially with upcoming multithreading in transform I think it would be very hard to maintain feature parity if the two packages diverged too much.

@pdeffebach Yes that's a concern. Everything DataFramesMeta does should be translatable relatively directly to DataFrames commands so that we don't implement things twice.

I think @nalimilan's proposal is good, but is not functionally different from a DropMissing([:x1, :x2]) => fun syntax that could go in the data frames mini-language.

I do agree with Milan that ByRow being the default would solve a lot of these problems.

Actually I think my proposal is a bit different, as you need to run first column-wise operations and then insert them in the row-wise function. I'm not sure how it could be translated into the select/transform syntax...

bkamins commented 4 years ago

OK - given these comments can you explain why you think:

coalesce(x>y, false)

is bad?

nalimilan commented 4 years ago

I don't think it's bad as a low-level way of doing things, as it always works and is clear about what it does. I just think for DataFramesMeta we need a general solution which allows you to write the code as if there weren't any missing values (possibly adding an argument/macro to enable that behavior), so that you don't need to constantly adjust the code using various strategies which are hard to remember.

bkamins commented 4 years ago

OK - for DataFramesMeta.jl this is fine. But then what for DataFrames.jl?

We started a discussion with adding skipmissing and passmissing keyword arguments. For me - from the discussion - it turns out that other options like coalesce(x>y, false) are not that much longer and they are still clear what is going on as you noted. So is then your suggestion to leave things in DataFrames.jl as they are and just improve the documentation to teach users properly use:

? (maybe just adding the skipmissingfun wrapper - probably with a shorter name - as @pdeffebach proposed as it cannot be handled using the options above when you pass multiple columns in source => fun => destination)

pdeffebach commented 4 years ago

proposed as it cannot be handled using the options above when you pass multiple columns in source => fun => destination)

It can! I wrote a PR for that in Missings that does this.

bkamins commented 4 years ago

I know this PR, but it does not do what we need in DataFrames.jl:

get three things:

  • a list of vectors as positional arguments
  • a NamedTuple of vectors
  • an AbstractDataFrame

and that solution does not handle them all unfortunately (otherwise this is exactly what we need :+1: )

bkamins commented 4 years ago

What we could do is:

  1. add NamedTuple of vectors support in Missings.jl - @pdeffebach => would you agree to add it there
  2. release Missings.jl
  3. add AbstractDataFrame support in DataFrames.jl - as this would not be a type piracy => @pdeffebach => also you are then welcome to add this method (it should be just a view of this data frame removing rows with missings)
bkamins commented 4 years ago

OK, so given also a short discussion with @pdeffebach on Slack:

The SkipMissing wrapper would be supported only in select and transform and if GroupedDataFrame was passed to it it would not employ the fast path (but I guess this is most likely not a use case for fast path anyway - right?).

nalimilan commented 4 years ago

Sounds like a good idea for DataFrames and probably for Tables.jl in general.

For DataFramesMeta, here's a simpler proposal than the one I made above, and which should be easy to implement using DataFrames functions. @transform/@select/@combine would be either row-wise or column-wise, with some syntax like ByRow/ByCol, @byrow/@bycol or byrow=true/bycol=true to switch between these (not sure which one should be the default, that's a separate discussion). In some cases this is a bit annoying as you need to use broadcasting explicitly when operating column-wise, e.g. in @filter(df, x .> mean(x)) or @transform(df, y = x .- mean(x)) but maybe it's not the end of the world as these are not the most common operations.

Missing values would be propagated or skipped by default. For row-wise operation, (the equivalent of) passmissing=true would be the default (and we could throw an error if ismissing is used to help users). For column-wise operation, skipmissing=true would be the default; additionally for @select/@transform, if the function returns a vector (of the same length as its input), the passmissing=true behavior would be used. This will allow "doing the right thing" for both y = mean(x) and y = x .- mean(x). So in practice ByRow(f) and x -> f.(x) would be equivalent, which is an important property. One would be able to pass passmissing and skipmissing manually as arguments to override the default. This would be needed notably when recoding missing values, using ismissing or things like replace(x, missing => 0) (which is actually the main drawback of propagating missing values).

Does that sound reasonable?

nalimilan commented 4 years ago

Independent of this a SkipMissing wrapper of the form SkipMissing(source) => fun => destination can be added that will do the following:

  1. take `source` columns
  2. apply `skipmissings` to them
  3. perform computation and pseudo broadcasting into length of the input after performing `skipmissings`
  4. finally map the result of step 3 into original length of `source` leaving `missing` values in rows that were skipped by `skipmissings`

@bkamins What you describe sounds useful in general, and in particular for my last DataFramesMeta proposal (though note I suggest filling all rows and not just non-missing rows when the function returns a scalar). But I think we should find another name for this, as I would expect SkipMissing(f) to be equivalent to f ∘ skipmissing. So this is unlikely to be accepted in Base. SkipAndFillMissing would be accurate but a bit verbose. SkipFillMissing? Or just FillMissing?

The SkipMissing wrapper would be supported only in select and transform and if GroupedDataFrame was passed to it it would not employ the fast path (but I guess this is most likely not a use case for fast path anyway - right?).

The fast path only supports single-row reductions anyway, right? So that doesn't apply here.

bkamins commented 4 years ago

@nalimilan - thank you for the comments.

I would expect SkipMissing(f) to be equivalent to f ∘ skipmissing

The point is that @pdeffebach proposed to have SkipMissing not on fun but on source, so you would write SkipMissing([:x1, :x2]) => cor => :cor for instance, and this allows us to define it more flexibly (but I agree we could also have a better name if we can think of any)

I suggest filling all rows and not just non-missing rows when the function returns a scalar

I was afraid that it would be confusing that returning a vector and returning a scalar behave differently. But maybe this is OK (on purpose I excluded combine from the list as in that case we would hit many more problems, with select and transform things are easier).

I think we need to decide here which operation is more useful: your proposal or my proposal. I feel that my proposal gives a more consistent design, but maybe the usefulness of your proposal outweighs this shortcoming (but let us deeply discuss this, not to go back to it later - like we have recurring discussions if select should be by row or by whole column by default - as when we make a decision it will be fixed for a long time - 😄).

In particular - given your semantics how would you achieve my result (i.e. fill only non-missing with the aggregate)?

The fast path only supports single-row reductions anyway, right? So that doesn't apply here.

If we accept your proposal (i.e. fill scalars everywhere) then there is no difference. If we accept my proposal (i.e. broadcast scalar only to non-missing slots) then there is a difference between:

select(df, :x , :y => sum ∘ skipmissing)

and

select(df, :x , SkipMissing(:y) => sum)
nalimilan commented 4 years ago

The point is that @pdeffebach proposed to have SkipMissing not on fun but on source, so you would write SkipMissing([:x1, :x2]) => cor => :cor for instance, and this allows us to define it more flexibly (but I agree we could also have a better name if we can think of any)

OK. The problem is, SkipMissing(cols) => fun already has a meaning, i.e. apply fun to columns in vector cols, excluding missing values if it contains some. Granted, it's a quite unlikely usage, but still...

In particular - given your semantics how would you achieve my result (i.e. fill only non-missing with the aggregate)?

In DataFramesMeta, you would just pass passmissing=true explicitly. In DataFrames, you would use skipmissing or skipmissings manually instead of using SkipMissing(cols) => fun/SkipFillMissing(cols) => fun/whatever. In both, you could also return fill(value, length(x)) instead of just value.

I agree this is an important point to settle. I suspect that in practice when doing reductions rows with missing values should get the same value as others. This is the case for example if you want to add a variable containing group means. That's what Stata's egen does AFAICT (right @matthieugomez?). Can we find examples where we would like the opposite behavior (i.e. rows with missings are filled with missing instead of with the returned scalar)?

bkamins commented 4 years ago

SkipMissing(cols)

Ah - I have forgotten that it is in Base, but it is just unexported. Then yes - we should not change the meaning of this name.

In both, you could also return fill(value, length(x)) instead of just value.

Fair point, except that in ByRow you do not have an access to length(x), especially in GroupDataFrame context. (but you can get it using other means anyway so it would be OK)

Can we find examples where we would like the opposite behavior

Thinking of it - you are probably right. As later if you use this reduction with missing in the original column it will become missing anyway.

The only reduction I can think of is:

transform(df, SkipMissing(:x) => x -> true)

where we get true indicator for non-missing, but of course it does not make much sense to do such a transformation.

matthieugomez commented 4 years ago

In Stata, mean of a variables creates a variable equal to the mean on non-missing rows, and missing values otherwise. Edit: no sorry it creases the mean for every rows.

I have also realized that, one issue with passmissing for functions that return vectors (as proposed by @nalimilan), is that lag will not work correctly.

bkamins commented 4 years ago

is that lag will not work correctly.

can you please give an example what you exactly mean?

pdeffebach commented 4 years ago

@matthieugomez I was trying to write a post comparing

gen y = mean(x)
gey y = mean(x) if !missing(x)

But I don't have a Stata installation at the moment. Can you confirm when the mean is different?

I think the proposed behavior is fairly close to idiomatic stata. Enough that you can translate one-for-one, but I want to make sure .

matthieugomez commented 4 years ago

@nalimilan Yes in the second case the mean is only given on rows that are not missing.

matthieugomez commented 4 years ago

@bkamins sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]

pdeffebach commented 4 years ago

sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]

That's fine, though. lag doesn't error with missing values so there is no need to do it (assuming we don't make DropMissing the default, which I don't think we should yet).

lags are hard anyways, you have to remember grouping etc.

nalimilan commented 4 years ago

Yes, lag/lead/diff are another case that would require opting-out explicitly of missing values propagation (if we make it the default). If we think this kind of situation (in addition to ismissing, etc.) is too common or too confusing, we could avoid it by default and require people to use skipmissing=true/ passmissing=true explicitly. Anyway we could first try this system and later see whether it should be the default.

But for simplicity I think it would be better to have a single argument called skipmissing that would do both, since we concluded that there are no major use cases in which passmissing and skipmissing would need to be separate. For row-wise operation, skipmissing somewhat makes sense as a way to say "only apply the operation to non-missing rows (and fill others with missing)". It's a slight abuse but probably OK. For column-wise operation, skipmissing would be exactly what happens to input columns, and the passmissing behavior would just be a consequence of that: if the function returns a single scalar, we can fill all rows, including missing ones (passmissing=false in our terminology), but if it returns a vector of the same length as its input, we can only fill non-missing rows (passmissing=true in our terminology).

bkamins commented 4 years ago

(if we make it the default)

I understand you are talking about DataFramesMeta.jl here - right. In DataFrames.jl I do not think we will make it a default as it would be breaking.

But for simplicity I think it would be better to have a single argument called

Again - I understand you mean it for DataFramesMeta.jl - right? As for data frames, we want to have a wrapper around column selector in source => fun => destination (we need a name as we cannot use SkipMissing).

bkamins commented 4 years ago

Regarding the discussion #2484.

I think that where should have the following specification (I leave out some sanity checks but give a simplified idea of the impementation):

An element of args can only be of the form cols => fun and we transform it to cols => fun => x_i (to make sure we have unique output column names).

Regarding the comment by @nalimilan:

One potential issue is that having where skip missing values but other functions like combine, select and transform not skip them

I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.

We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)

pdeffebach commented 4 years ago

You need gdf above, as transformations are on the group level when given a grouped data frame above.

I think we can use isequal without addressing transform and select. We aren't skipping missing during the transformation, just when we choose which rows to keep.

pdeffebach commented 4 years ago

I don't think we are ready to address transform and select yet. There hasn't been consensus on behavior. better to implement something in Missings.jl here and give users a chance to see if they like it.

bkamins commented 4 years ago

You need gdf above

fixed, it was a typo

matthieugomez commented 4 years ago

+1. It would also be nice to have a where! version, as well as a view kwarg for where.

Last thing is the name. I like where. The only issue is that it is used for special syntax in Julia. Is that a problem? In particular, I think it creates some issues with using Lazy’s macro @>, but maybe it can be fixed.

On Mon, Oct 19, 2020 at 6:27 AM Bogumił Kamiński notifications@github.com wrote:

You need gdf above

fixed, it was a typo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-712156073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXPIIASGWALSXHZIEZ3SLQ5FXANCNFSM4OTTLDUA .

bkamins commented 4 years ago

It would also be nice to have a where! version, as well as a view kwarg for where.

Sure - we would mirror that - thank you for raising this, as it is easy to forget.

I like where

I think if we like it (and I guess we do not have a better name) then we should try to make sure that the "ecosystem" works with it. For sure this is not a problem for DataFramesMeta.jl. @pdeffebach - have you experimented with this name in DataFramesMeta.jl.

pdeffebach commented 4 years ago

where works with linq, but the Lazy pipings all work with @where so there hasn't been a conflict.

nalimilan commented 4 years ago

I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.

We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)

Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true) is really skipping rows for which x is missing. And I would be inclined to call the argument skipmissing=true instead of errror_on_missing=false (that would be more standard given our terminology).

TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine/select/transform should. Of course it's much simpler and obvious what to do to handle missing values with where, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine/select/transform in the presence of missing values.

matthieugomez commented 4 years ago

Is not it what dplyr does already (ie skip missing in filter but not in mutate)?

On Mon, Oct 19, 2020 at 1:13 PM Milan Bouchet-Valat < notifications@github.com> wrote:

I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.

We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)

Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true) is really skipping rows for which x is missing. And I would be inclined to call the argument skipmissing=true instead of errror_on_missing=false (that would be more standard given our terminology).

TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine/select/transform should. Of course it's much simpler and obvious what to do to handle missing values with where, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine/select/transform in the presence of missing values.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-712415696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNZOOAMIVCDP2VM3TLSLSMX3ANCNFSM4OTTLDUA .

bkamins commented 4 years ago

Well I'd say you're playing with words. :-)

The point is that we are not skipping missing in the sense how e.g. select would skip them. As for select skipmissing means REMOVE FROM THE COLLECTION or IGNORE THEM while in where it is TREAT THEM AS FALSE. And this is the point of my reasoning. This is a different operation that we perform (skipping missings in the sense of select in where would make no sense).

skipmissing=true

For the above reason I have not proposed skipmissing kwarg but some other name (of course I am not attached to the proposed one as it is not nice obviously).

My point is that the decision what to do in select and how to implement where are logically orthogonal although they have very similar end effect.

bkamins commented 4 years ago

Regarding where for GroupedDataFrame we have two options:

  1. it would produce rows in the order of the parent (i.e. it would internally rely on select; in this scenario subsetting groups in GroupedDataFrame is not allowed)
  2. it would produce rows in the order of groups (i.e. it would internally rely on combine; in this scenario subsetting groups in GroupedDataFrame is allowed)

Which one do we prefer?

pdeffebach commented 4 years ago
  1. As it's what I've already implemented in DataFramesMeta. But in general i think its good to not re-order things too much, and this is way closer to the mental model of select then filter.