Open nalimilan opened 4 years ago
- Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g.
@linqskipmissing
macro or a statement likeskipmissing
within a@linq
block that would automatically passskipmissing=true
to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.
This is a good idea, however I also like skipmissing = true
at the level of a transform
call or even at the level of argument because it's explicit.
Perhaps DataFramesMeta could provide a block-level skipmissing
option as well as a macro like
@transform(df, @sm y = :x1 .+ mean(:x2))
Where @sm
macro, does the necessary transformation described in #2258
function wrapper(fun, x, y)
sx, sy = Missings.skipmissings(x, y) # need to be on Missings master
sub_out = fun(sx, sy)
full_out = Vector{Union{eltype(sub_out), Missing}}(missing, length(x))
full_out[eachindex(sx)] .= sub_out # eachindex(sx) returns indices of complete cases
return full_out
end
- Have a field in
DataFrame
objects that would store the default value to use for theskipmissing
argument. By default it would befalse
, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to callskipmissing!(df, true)
once and then avoid repeating it.
I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.
It's great you're thinking about how to make working with missing values easier! I 100% agree.
A macro may be good.
I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like df = merge(df, df_using)
does not retain the option?
A third possibility would be to allow users to change the default option for transform
, etc, say by writing SKIPMISSING = true
at the start of a script.
To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?
I've long been frustrated with the difficulty of working with missing
given that it infects all downstream operations and wished it worked differently in Base. (I acknowledge my frustration could be misguided — perhaps it's saved me from some horrible bugs without knowing it :-) )
Perhaps DataFramesMeta could provide a block-level
skipmissing
option as well as a macro like@transform(df, @sm y = :x1 .+ mean(:x2))
Where
@sm
macro, does the necessary transformation described in #2258
@pdeffebach Yes that was more or less what I had in mind. Though if you have to repeat this for each call it's not much better than passing skipmissing=true
. Being able to apply it to a chain of operations would already be more useful.
I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.
@pdeffebach Yes. OTOH it's not so different from e.g. creating a column: if you get an error because it doesn't exist or it's incorrect you have to find where it's been defined, which can be quite far from where the error happens.
I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like
df = merge(df, df_using)
does not retain the option?A third possibility would be to allow users to change the default option for
transform
, etc, say by writingSKIPMISSING = true
at the start of a script.
@matthieugomez Yes, losing the option after transformations could be annoying. Though it could be propagated across some operations which always preserve missing values: in these cases there's little point in forcing you to repeat that you know there are missing values. Where safety checks matter is when you could believe you got rid of missing values and for some reason it's not the case. But I admit that this option would decrease safety (even if not propagated automatically) as you could pass a data frame to a function which isn't prepared to handle missing values and it would silently skip them.
A global setting would only aggravate these issues IMHO since it would affect completely unrelated operations, possibly in packages, some of which may rely on the implicit skipmissing=false
.
To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?
@c42f My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:
using Cassette, Missings
Cassette.@context PassMissingCtx
Cassette.overdub(ctx::PassMissingCtx, f, args...) = passmissing(f)(args...)
# do not lift special functions which already handle missing
for f in (:ismissing, :(==), :isequal, :(===), :&, :|, :⊻)
@eval begin
Cassette.overdub(ctx::PassMissingCtx, ::typeof($f), args...) = $f(args...)
end
end
f(x) = x > 0 ? log(x) : -Inf
julia> Cassette.@overdub PassMissingCtx() f(missing)
missing
julia> Cassette.@overdub PassMissingCtx() f(1)
0.0
julia> Cassette.@overdub PassMissingCtx() ismissing(missing)
true
julia> Cassette.@overdub PassMissingCtx() missing | true
true
Skipping missing values in reductions will be a little harder, but it's doable if we only want to handle a known list of functions. For example this quick hack works:
Cassette.overdub(ctx::PassMissingCtx, ::typeof(sum), x) = sum(skipmissing(x))
julia> x = [1, missing];
julia> Cassette.@overdub PassMissingCtx() sum(x)
1
Maybe this kind of thing could be made simpler to use by providing a macro like @passmissing
as a shorthand for Cassette.@overdub PassMissingCtx()
. But it's important to measure all the implications of this approach: since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects. For example, Cassette.@overdub PassMissingCtx() [missing]
gives missing
, which will break lots of package code. A safer approach would be to only apply passmissing
to a whitelist of functions for which it makes sense (scalar functions, mainly), like DataValues does -- with the drawback that the list is kind of arbitrary.
In the end, maybe Cassette is too powerful for what we actually need in the context of DataFrames. In practice with select
/transform
/combine
it makes sense to only apply passmissing
to the top-level functions, rather than recursively. And for reductions, passing views of complete rows as proposed at https://github.com/JuliaData/DataFrames.jl/issues/2258 should be enough, when combined with a convenient DataFramesMeta syntax.
My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:
Very cool. IIUC people have started to use other libraries like IRTools which plug into the compiler in a similar way to Cassette but are not Cassette itself, but yes, that's roughly what I had in mind. One downside is that it's a pretty heavy weight tool to deploy for something like missing.
since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects
I think this is the bigger problem; it's not clear how deep to recurse and it could definitely have unintended consequences. It's a very similar problem with floating point rounding modes, where having the rounding mode as dynamic state really doesn't work well as it infects other calculations which weren't programmed to deal with a different rounding mode.
So I agree it does seem much safer and more sensible to have a macro which lowers only the syntax within the immediate expression to be more permissive with missing
. Actually I wonder whether you could do something more like broadcast lowering where all function calls within the expression are lifted in a certain way such that dispatch can be customized, so, eg, @linq f(x,y,z)
becomes linq_missing(f, x, y, z)
.
In terms of your examples,
@linq filter(:col =>(x -> x > 1), df)
# means
linq_missing(filter, :col => (x -> linq_missing(>, x, 1)), df)
@linq combine(gd, :col => (x -> sum(x))
# means
linq_missing(combine, gd, :col => (x -> linq_missing(sum, x)))
Something like this has very regular lowering rules and gives a measure of extensibility for user defined functions.
One alternative is that filter/transform/combine could always skip missing (i.e. kwarg skipmissing = true
is the default).
Can you please help me understanding this proposal in detail? What should be the outcome of the following operations if we added skipmissing
option (now I am showing what we get now and would like to understand what we should get if we add skipmissing=true
):
julia> df = DataFrame(x=[1,missing,2])
3×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> select(df, :x)
3×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> select(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> select(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> combine(df, :x)
3×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> combine(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> combine(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
AFAICT this question regards #2258, better keep matters separate as discussions are already tricky enough.
With select
, skipmissing
would apply the functions on row for which
none of the argument is missing, return the output on these rows, and
missing otherwise (to maintain the same number of rows as the original
dataframe). So skipmissing = true
would not change anything in your
example. Similarly for transform
.
With combine
, skipmissing
would apply the functions on rows for which
none of the argument is missing.
julia> combine(df, :x)
3×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> combine(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> combine(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ 2 │
That being said, I am not sure why combine(df, :x)
or combine(df, :x => ByRow(.))
are accepted and it sounds that maybe it would be clearer to
disallow them?
Ah - sorry - I thought this issue "overriden" that one. So actually this issue should be on hold till we resolve #2258, because it only asks to make #2258 simpler - right?
Could we think about making filter
/transform
/combine
/select
have the kwarg skipmissing = true
by default? It's not unheard of — that's what Stata and Panda (for combine
) do. Also, data.table
and dplyr
automatically skip missing in their versions of filter
.
I would not have a problem with this. We already have it in groupby
so it would be consistent. So I assume you want:
filter
wrap a predicate p
in x -> coalesce(p(x), false)
transform
/combine
/select
we have two decisions:
view
s or skipmissing
wrappers? (given the second question probably view
s, it is going to be a bit slow, but I think skipmissing=true
does not have to be super fast as it is a convenience wrapper)missing
? (this is what groupby
does)Regarding 2, note that as discussed at https://github.com/JuliaData/DataFrames.jl/issues/2258 we have to pass views for select
and transform
as we need to be able to reassign values to the non-missing rows in the input. When multiple columns are passed, we should only keep complete observations, otherwise something like [:x, :y] => +
wouldn't work due to different lengths.
I should have noted that before thinking about making it the default, we should first implement this keyword argument and see how it goes.
@nalimilan Yes starting with a kwarg and see how it goes is the right way. The only reason I was mentioning the default value is that 1.0 may mean that this kind of stuff won't be able to change later on — I hope it's not the case.
@bkamins I agree with 1 and 2.1 (views). I think for the case of [:x, :y] => +
it should skip missing on both (as @nalimilan points out), but for [:x, :y] .=> mean
, or for :x => mean, :y => mean
, it should skipmissing on x
and y
separately.
Yes - for [:x, :y] .=> mean
this is separate (which means that transform
/select
will throw an error in cases when there is no match and a vector is returned).
OK - so it seems we have a consensus here. I will propose a PR.
Just a maintenance question - when we add skipmissing
kwarg should both #2314 and #2258 be closed or something more should be done/kept track of?
That would be awesome, thanks! Jus to make sure I understand, what do you mean by `transform/select will throw an error in cases when there is no match and a vector is returned).'?
If this happens, https://github.com/JuliaData/DataFrames.jl/issues/2258 should definitely be closed, but this thread should be open if the default value is false, since it may still be cumbersome to add it at every command.
transform/select will throw an error in cases when there is no match and a vector is returned
I mean tat this would still error:
julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 1 │
│ 2 │ missing │ missing │
│ 3 │ 2 │ missing │
julia> select(df, :a => collect∘skipmissing)
ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.
(but I guess this is natural to do it this way)
this thread should be open if the default value is false
OK - we can keep it open then. For 1.0 release the default will be false to be non-breaking.
That would be awesome, thanks!
Started thinking about it :). The trickiest part will be fast aggregation functions for GroupedDataFrame
case (as usual), but also they should be doable.
transform/select will throw an error in cases when there is no match and a vector is returned
I mean tat this would still error:
julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing]) 3×2 DataFrame │ Row │ a │ b │ │ │ Int64? │ Int64? │ ├─────┼─────────┼─────────┤ │ 1 │ 1 │ 1 │ │ 2 │ missing │ missing │ │ 3 │ 2 │ missing │ julia> select(df, :a => collect∘skipmissing) ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.
(but I guess this is natural to do it this way)
Just to clarify, I think the following should work:
select(df, :a => collect, skipmissing = true)
and actually also
select(df, :a => collect∘skipmissing , skipmissing = true)
since the function collect∘skipmissing is one to one on the set of values for which a is not missing. In both cases, it just returns the same thing as a
.
However, the following should error
combine(df, :a => collect, :b => collect, skipmissing = true)
because the length of collect∘skipmissing(a)
is different from the length of collect∘skipmissing(b)
Just to clarify, I think the following should work:
select(df, :a => collect, skipmissing = true)
Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:
select(df, :a => collect , skipmissing = true)
to work but
select(df, :a => collect∘skipmissing)
will fail.
Which means to me that if we want to add such a kwarg it should not be called skipmissing
as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved for select
and transform
as in combine
you expect a different behaviour.
In particular:
select(df, :a => mean, skipmissing = true)
and
select(df, :a => mean∘skipmissing)
would both work but produce different results.
My proposal above about "spreading" missing values seems relevant here.
select(df, :a => collect, skipmissing = true)
This takes :a
, applies skipmissing
, collects the result, then loops through indices of :a
and, when not missing`, fills it in.
select(df, :a => collect∘skipmissing , skipmissing = true)
This takes :a
, applies skipmissing
, then applies it again, collect
s the result, and loops through indices of :a
as before.
select(df, :a => mean, skipmissing = true)
Takes :a
, applies skipmissing
, takes the mean, then loops through and fills in indices of :a
where indices of :a
are not missing. Perhaps we can make an exception for scalar values? If a scalar is returned, it gets spread across the entire vector, regardless of missing values? That way it would match
select(df, :a => mean∘skipmissing; skipmissing = false)
EDIT: After playing around with R, now I'm not so sure. I think the biggest annoyance is with filter
currently and we should start with that since everyone agrees on that behavior and we are confident it won't result in unpredictable / inconsistent behavior.
Just to clarify, I think the following should work: select(df, :a => collect, skipmissing = true)
Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:
select(df, :a => collect , skipmissing = true)
to work but
select(df, :a => collect∘skipmissing)
will fail.
Which means to me that if we want to add such a kwarg it should not be called
skipmissing
as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved forselect
andtransform
as incombine
you expect a different behaviour.In particular:
select(df, :a => mean, skipmissing = true)
and
select(df, :a => mean∘skipmissing)
would both work but produce different results.
Exactly. Even though I don’t like different keyword argument, I see the potential confusion for select/transform.
And I see the uses of both functionalities :smile:, simply they should have a different name. Actually we can have both, i.e. skipmissing
doing "hard" skipmissing
everywhere and something else which passes missings, and the name could be passmissing
as passmissing
in Missing.jl does exactly this kind of thing.
Yes, passmissing
would be a more accurate name. OTOH there would be reasons to use skipmissing
:
combine
, since there's no correspondence between input and output rows, we cannot fill non-missing entries, so the only behavior we can implement is skipmissing
, not passmissing
-- except for the special-case of ByRow
which isn't very useful with combine
. Using consistent names between combine
and select
/transform
sounds like a good thing, even if the behavior differs a bit.Overall I'm not sure which choice is best. As suggested by @pdeffebach, we could have skipmissing
with a special case for functions that return a single scalar, in which case we would fill even rows with only missing values: this is justified because there's no correspondence between input and output rows in that case (like with combine
). We could also just throw an error telling to use mean∘skipmissing
for now.
I think I've found a possible solution. We could add a passmissing=true
argument to select
and transform
(as described above). The default to true
would be fine for 99% of use cases and would probably not break a lot of code (if any). It would also be consistent with the approach Julia takes regarding missing values: they either propagate or throw an error, but are never skipped silently by default.
Then we would also add a skipmissing=false
argument to combine
and filter
, consistent with groupby
. People would have to pass skipmissing=true
or use skipmissing(...)
manually. That's a bit annoying, but it's safe as missing values will never be dropped silently even if they were propagated silently due to the passmissing=true
default. Since passmissing=true
by default, users won't have to use that argument often (or never), so they won't be confused by the existence of both passmissing
and skipmissing
.
How does that sound? The main drawback of this approach is the inconsistency between select
/transform
and combine
, but that's probably OK as these are quite different operations anyway.
That’s interesting.
I was leaning on the opposite solution: skipmissing = true default for filter and combine, passmissing = false default for transform.
Skipmissing = true for filter is consistent with other languages (at least stata/dplyr/data.table), and that is what people want in 100% of cases.
Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.
In contrast, passmissing = true in transform/select would be unique to Julia. It means that stuff like :x => ismissing will give missing instead of true for missing values. I still think it is worth having it but it may be harder to grasp conceptually.
So, in conclusion, my preferred solution would be skipmissing = true for combine and filter, and passmissing = true for select/transform. I get your point about compatibility with the spirit of missing in Base but I don’t see it as that important.
Sorry what is passmissing
? the concept of passmissing
only makes sense to me in ByRow
where you could conceivably have missing
as an input instead of a vector.
@pdeffebach I was using for the behavior I was describing for select/transform, whether it is actually called passmissing or skipmissing.
Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.
I'm not sure that's true. The problem Milan is describing is
combine(groupby(df, :a), :x => t -> first, :y => t -> first, skipmissing = true)
In Stata you would just get missing
s if the first observation is missing
. But you would be confident that this was truly the first observation for every group.
Under your proposal, the functions would expand to be
combine(groupby(df, :a), :x => t -> first(skipmissing(t)), :y => t -> first(skipmissing(t))
If :x
has missing
as the first element, then you are going to get the second element of :x
and the first element of :y
. This will definitely lead to bugs.
That’s right, I did not think about that.
Note that, the way I understand it, first
in select, with passmissing =
true, would suffer for the same problem (ie it would give the first non
missing).
Can you think of other examples that may give surprising results in combine or select/transform? It would be nice to do a list (beyond first and ismissing).
On Tue, Aug 25, 2020 at 7:03 AM pdeffebach notifications@github.com wrote:
Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.
I'm not sure that's true. The problem Milan is describing is
combine(groupby(df, :a), :x => t -> first, :y => t -> first, skipmissing = true)
In Stata you would just get missings if the first observation is missing. But you would be confident that this was truly the first observation for every group.
Under your proposal, the functions would expand to be
combine(groupby(df, :a), :x => t -> first(skipmissing(t)), :y => t -> first(skipmissing(t))
If :x has missing as the first element, then you are going to get the second element of :x and the first element of :y. This will definitely lead to bugs.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-680044858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNJZKXBPXFC4LORTWDSCPAB7ANCNFSM4OTTLDUA .
Whether skipmissing=true
take into account missings in all columns or only to those passed to each function is an interesting design decision to make. But AFAICT it's orthogonal to whether we should use passmissing=true
and skipmissing=false
by default or not. ~And it doesn't concern passmissing=true
, which would not apply to combine
~ EDIT: it does if we allow broadcasting a scalar result to all non-missing rows.
The main question for choosing the defaults is whether we favor safety or convenience. I'm always torn between these two goals, which is why I tried to proposed in the OP various solutions that don't require silently ignoring missing values by default.
Sorry what is
passmissing
? the concept ofpassmissing
only makes sense to me inByRow
where you could conceivably havemissing
as an input instead of a vector.
@pdeffebach It also works for any function passed to select
and transform
: we just pass a view of non-missing rows, and copy the resulting vector to the non-missing rows, filling with missing
other entries. This is what you proposed above.
I think @pdeffebach was just disagreeing with my claim that combine with skipmissing = true is what Stata does (while mean ignores missing values, first gets the first element of the column, whether it’s missing or not)
We could also define the kwarg maskmissing, and use it for filter/combine/transform, rather than using a combination of passmissing/skipmissing kwarg.
The definition of maskmissing is that it applies the operation on rows where columns are non missing. This explanation is consistent with what we want for filter/combine/transform, and so this would allow to get the same kwarg for all of these functions. (of course, for combine, it is exactly the same as skipmissing).
The thing is, people don't really complain about na.rm = TRUE
though, right? I feel like most of the problems we see about missing
are really easily fixed by pointing people to skipmissing
and passmissing
.
combine(gd, :col => (x -> sum(skipmissing(x)))
Is really ugly, yes, but in the future you will be able to write
using DataFramesMeta
@combine(gd, y = sum(skipmissing(col))
which is not that bad.
For me the problem is that for now it requires different strategies for filter (which requires coalesce), combine (which requires skipmissing). Another issue with combine is the whole thing about reduction other several vectors with different missing rows such as weighted mean, correlation etc )
Finally, transform requires sometimes passmissing, sometimes skipmissing, coalesce for ifelse calls, something
for tryparse
, etc. Also, there’s no good way currently to handle functions with vector input and vector output such as cumsum
.
.
That's fair. It would be hard to tackle each of these individually,
filter
should apply coalesce
by default. skipmissing = true
should apply passmissing
to ByRow
functionsifelse
that skips bad elementstryparse
or a whole destring
package which would take a lot of workThen there’s the whole thing about reduction other several vectors with different missing rows such as weighted mean, correlation etc
Also an issue, skipmissings
helps weighted mean
, but unfortunately cor
didn't make it into Statistics
.
Okay that's fair. I'm on board with a keyword argument with a variety of behaviors depending on the context.
I will take a defensive stance here.
Before I state what I think I will explain the principles of my judgement:
Given this:
filter
should apply coalesce by default.
I think we can add skipmissing
kwarg with false
as default. This is what Base does. If we can convince Julia Base to change the behaviour here then we will sync with that behaviour.
skipmissing = true
should applypassmissing
toByRow
functions
Do you mean it for select
/transform
or also for combine
? (also see my comment below)
- We need a lazy broadcasted
ifelse
that skips bad elements- Better
tryparse
or a wholedestring
package which would take a lot of work
Agreed, but it is out of scope of DataFrames.jl. I would also add x -> coalesce(x, false)
to the list with some convenient name, I think coalescefalse
is explicit but maybe you judge it to be too long. I would add this to Missings.jl.
but unfortunately
cor
didn't make it intoStatistics
.
I think it is an open issue how to do it rather than closed issue as there are many ways to compute cor
on a matrix with missings.
(x -> sum(skipmissing(x)))
this is normally written as sum∘skipmissing
and it is not that bad I think.
Now regarding the design the more I think about it the more I want do the following that gives a fine grained control of what is going on in the function in a single transformation:
skipmissing
would not be a kwarg but it would be an extension to the minilanguage we have for select
, transform
and combine
.
If in the form:
source => fun => destination
if you wrap source
in skipmissing
then fun
in combine
, select
and transform
gets view
s of passed columns with rows containing missing
s skipped.
In general the idea is that source => fun∘skipmissing
is the same as skipmissing(source) => fun => destination
if source
is a single column, but we add this extension to allow easy handling of skipping rows in multiple-column scenarios
For filter
we do not add anything but just write users to use coalescefalse∘fun
if we agree to add such a function to Missings.jl.
passmissing
It is not needed. Just tell users to write passmissing(their_function)
- this is exactly what you want in ByRow
and if I understand things correctly this is a main use case - is this correct?
I think this is a good proposal.
We might not even need skipmissing
in the meta-language. We can add a skipmissingsfun
wrapper similar to passmissing
that wraps all <:AbstractVector
inputs in skipmissings
. Though since it applies to the source I think a SkipMissing
is more logical.
@bkamins I am trying to understand your proposal:
filter
My comment in the first part was just that I do not want skipmissing=true
by default (no matter what syntax we apply).
My proposal is not to change filter
at all but define coalescefalse(x) = coalesce(x, false)
(the name can be discussed as this one is a bit long) and then just write:
filter(coalescefalse∘fun, df)
The beauty is that it would also work in Julia Base then which is a big benefit.
cumsum
I understand you want this:
julia> df = DataFrame(x=[1,2,missing,3])
4×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ missing │
│ 4 │ 3 │
julia> transform(df, :x => (x -> cumsum(coalesce.(x, 0)) + x .* 0) => :y)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 3 │
│ 3 │ missing │ missing │
│ 4 │ 3 │ 6 │
Then I agree - I would not handle this in my proposal. I assume that passmissing
is mostly for ByRow
as noted above.
The point is that the use cases like cumsum
I think are quite rare (maybe I am wrong here).
The general idea is to try using function composition and higher order functions rather than keyword arguments as this seems more flexible for the future.
We can add a
skipmissingsfun
wrapper similar topassmissing
that wraps all<:AbstractVector
inputs inskipmissings
.
I was thinking about it, and initially judged that this is not possible as fun
in source => fun => destination
can get three things:
NamedTuple
of vectorsAbstractDataFrame
But now as I think about it actually it is not a problem as this higher order function could just handle these three cases separately in if
-else
blocks. It is better than my original proposal as it is more composable (select
/transform
/combine
do not even need to know what it does - and this is good because it orthogonality in design makes it much easier to maintain in the long term) The question is what name could be used here. The issue is that skipmissing
is defined in Julia Base and adding skipmissing(::Base.Callable)
would be type piracy (though it is something that is invalid in Base currently - not sure what other name would be good though - maybe dropmissing
?)
@bkamins pointed me here, but unfortunately this thread is a bit longer than I've currently got time for. The reason I was pointed here was that I asked for comments on a proposal that touches on the topic of this thread: I am actually mostly in favour of a more cumbersome, force-the-user-to-think-carefully-and-be-explicit approach for the most part, my main gripe with missing
is when I filter a DataFrame (i.e. use ==
, >
, <
). In those cases, the additional verbosity required to get things to work is quite large, and to me at least the possibility of accidentally doing something unintended is low - I've never encountered a use case where in doing df.col .> 5
I wanted rows for which col
is missing
to be included in the result.
With these two considerations in mind, in my own code I started defining:
⋘(x, y) = (ismissing(x) | ismissing(y)) ? false : (x < y)
⋙(x, y) = (ismissing(x) | ismissing(y)) ? false : (x > y)
⩸(x, y) = (ismissing(x) | ismissing(y)) ? false : (x == y)
I clearly didn't spend much time choosing the opearators, I just wanted infix operators that look similar to but different from the regular comparison operators (one issue with my current selection is that their names are completely different: \verymuchless
, \ggg
and \equivDD
aren't exactly easy to memorize as a group of operators...)
In any case this relatively simple case for me solves the main usability issue I encounter in day-to-day work.
@bkamins you mentioned there might be a way to have ifelse, filter work with missing values in base. You really think so? Maybe that’s one way to go.
Maybe that’s one way to go.
I meant that this issue could be opened in Julia Base and discussed there. I feel that if it would get support it would be 2.0 release the earliest. Therefore I think we should have a solution for now in DataFrames.jl that does not rely on Julia Base.
(still - if you feel it is worth discussing please open an issue on Julia Base; I just recommend to be patient - core devs tend to be conservative and require quite a lot of justification for changes)
EDIT Actually - given our discussions apart from ifeslse
and filter
the third big thing is getindex
with Union{Missing, Bool}}
eltype selector
I agree that it would also be OK to keep DataFrames as a relatively low-level package which doesn't offer the most convenient syntax, as long as it provides the building blocks to do anything. Anyway it's clear that the "mini-language" will never be as intuitive as DataFramesMeta's macros. So maybe we should try to see at the same time how we want to make working with missing values easy in DataFramesMeta, and see what we need in DataFrames to allow implementing it.
Adding skipmissing(f::Callable)
to Base sounds doable. I'm less sure about filter
and getindex
skipping missing values, as this goes against what happens everywhere else (missing values are never dropped silently). But we don't need that for DataFrames/DataFramesMeta AFAICT.
The difficulty with finding a good solution in DataFramesMeta is that the equivalent implementations like dplyr rely on the fact that functions are vectorized and/or propagate missing values in totally ad-hoc way. For example, R/dplyr allows mutate(df, y = isna(x) ? "0" : as.character(x))
, where as.character
propagates missing values, but isna
does not. R/dplyr also allows mutate(df, y = x - mean(x))
since -
is vectorized, but we require .-
in Julia -- which is OK here but can get quite unwieldy with complex commands. The fact that Julia is much more systematic is a strength for advanced users, but it can also be a barrier for more basic use, as R usually does "the right thing" (but when it doesn't it's a pain).
Maybe we could handle this difficulty by having all macros in DataFramesMeta operate row-wise by default, and require a special syntax to indicate that a column should be accessed as a vector. For example, @transform(df, y = x - mean($x))
. Then it seems to me that the most intuitive behavior is this:
$
) are wrapped in a view of non-missing values by default (i.e. skipmissing=true
)$
), set passmissing=true
(since we are in a case equivalent to ByRow
)$
) returns a vector (with its length equal to the number of non-missing rows) rather than a scalar, set passmissing=true
AFAICT this would do "the right thing" in the following situations:
@transform(df, y = mean($x))
: assign to all rows (including those with missing values) the mean of non-missing values@transform(df, y = x - mean($x))
: subtract to each non-missing row the mean of non-missing values@transform(df, y = cumsum($x))
: assign to rows with non-missing values the cumulative sum of values for these rows@transform(df, y = s - cumsum($x))
: subtract to each non-missing row the cumulative sum of values for these rowsCases that wouldn't do "the right thing" by default are uses of ismissing
(as missing values would always be skipped by default). We decide to be smart and set automatically passmissing=false
or skipmissing=false
when an expression contains ismissing
, or just throw an error with an informative message. A more tricky case are logical operators which implement three-valued logic: with passmissing=true
, true | missing
would be assigned missing
since |
would not actually be called. Maybe that's not a big deal.
I was hoping that DataFramesMeta would just be DataFrames + a macro to
create Pairs expressions (e.g. DataFramesMacros), but maybe that's the
wrong way to look at it. My concern is that having a barebone DataFrames
effectively creates 3 different syntaxes: the simple df.a
syntax, the
transform syntax, and the @transform
syntax.
I need to think more about these proposals. In the meantime, I just wanted
to point out that Stata has two different commands for row-wise vs,
column-wise transforms (gen v.s. egen). That is something to think about if
that would simplify thing, e.g. there could be a row version of
transform
, called, say, alter
.
On Wed, Aug 26, 2020 at 9:41 AM Milan Bouchet-Valat < notifications@github.com> wrote:
I agree that it would also be OK to keep DataFrames as a relatively low-level package which doesn't offer the most convenient syntax, as long as it provides the building blocks to do anything. Anyway it's clear that the "mini-language" will never be as intuitive as DataFramesMeta's macros. So maybe we should try to see at the same time how we want to make working with missing values easy in DataFramesMeta, and see what we need in DataFrames to allow implementing it.
Adding skipmissing(f::Callable) to Base sounds doable. I'm less sure about filter and getindex skipping missing values, as this goes against what happens everywhere else (missing values are never dropped silently). But we don't need that for DataFrames/DataFramesMeta AFAICT.
The difficulty with finding a good solution in DataFramesMeta is that the equivalent implementations like dplyr rely on the fact that functions are vectorized and/or propagate missing values in totally ad-hoc way. For example, R/dplyr allows mutate(df, y = isna(x) ? "0" : as.character(x)), where as.character propagates missing values, but isna does not. R/dplyr also allows mutate(df, y = x - mean(x)) since - is vectorized, but we require .- in Julia -- which is OK here but can get quite unwieldy with complex commands. The fact that Julia is much more systematic is a strength for advanced users, but it can also be a barrier for more basic use, as R usually does "the right thing" (but when it doesn't it's a pain).
Maybe we could handle this difficulty by having all macros in DataFramesMeta operate row-wise by default, and require a special syntax to indicate that a column should be accessed as a vector. For example, @transform(df, y = x - mean($x)). Then it seems to me that the most intuitive behavior is this:
- all columns accessed as vectors (with $) are wrapped in a view of non-missing values by default (i.e. skipmissing=true)
- if some columns are accessed row-wise (no $), set passmissing=true (since we are in a case equivalent to ByRow)
- if a function applied to a column vector (accessed using $) returns a vector (with its length equal to the number of non-missing rows) rather than a scalar, set passmissing=true
AFAICT this would do "the right thing" in the following situations:
- @transform(df, y = mean($x)): assign to all rows (including those with missing values) the mean of non-missing values
- @transform(df, y = x - mean($x)): subtract to each non-missing row the mean of non-missing values
- @transform(df, y = cumsum($x)): assign to rows with non-missing values the cumulative sum of values for these rows
- @transform(df, y = s - cumsum($x)): subtract to each non-missing row the cumulative sum of values for these rows
Cases that wouldn't do "the right thing" by default are uses of ismissing (as missing values would always be skipped by default). We decide to be smart and set automatically passmissing=false or skipmissing=false when an expression contains ismissing, or just throw an error with an informative message. A more tricky case are logical operators which implement three-valued logic: with passmissing=true, true | missing would be assigned missing since | would not actually be called. Maybe that's not a big deal.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-680993848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXI2TMYOSLOB3XBVAWTSCU3KHANCNFSM4OTTLDUA .
@nalimilan - thank you for commenting on this. I guess it is best to move the discussion to DataFramesMeta.jl. My questions would be twofold:
s - cumsum($x)
I was not clear what would happen in the scenario s=[missing, missing, 1]
and x=[1,2,3]
.@matthieugomez - I agree that we could have DataFramesMacros.jl (when it matures) be just an addition to DataFrames.jl and this should work. But for DataFramesMeta.jl I accept it adds new things (just as e.g. queryverse adds) - the user simply has to choose what one wants to use.
However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names lt
, leq
, gt
, geq
, eq
defined like:
lt(x,y) = coalesce(x<y, false)
would be super convenient. I would propose to have them in Missings.jl. We would take hold of 5 very short names (bad thing) but would get flexibility in a place where it is much needed, also I think writing lt(x,y)
is not that much worse than x < y
. If we had these five functions I think that mostly we do not need coalescefalse
, as in other cases user can just write x -> coalesce(x, false)
since they will be rare anyway.
EDIT: if we went for this also neq
I agree with @matthieugomez, especially with upcoming multithreading in transform
I think it would be very hard to maintain feature parity if the two packages diverged too much.
I think @nalimilan's proposal is good, but is not functionally different from a DropMissing([:x1, :x2]) => fun
syntax that could go in the data frames mini-language.
I do agree with Milan that ByRow
being the default would solve a lot of these problems.
However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names
lt
,leq
,gt
,geq
,eq
defined like:
I think a very smart macro could fix this easier.
@missingfalse x > y
I think a very smart macro could fix this easier.
@missingfalse x > y
this is clearly doable as it is the same as:
coalesce(x>y, false)
but just a bit longer to write :smile:
My point is - and this is how I understood what @nilshg wanted is to have something that is easy to type. And I feel that writing e.g. lt.(df.col1, df.col2)
is not much worse than df.col1 .< df.col2
and much better than @missingfalse df.col1 .< df.col2
or coalesce.(df.col1 .< df.col2, false)
.
It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like
filter(:col => x -> coalesce(x > 1, false), df)
orcombine(gd, :col => (x -> sum(skipmissing(x)))
isn't ideal. One proposal to alleviate this is https://github.com/JuliaData/DataFrames.jl/issues/2258: add askipmissing
argument to functions likefilter
,select
,transform
andcombine
to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.That would be one step towards being more user-friendly, but one would still have to repeat
skipmissing=true
all the time when dealing with missing values. I figured two solutions could be considered to improve this:@linqskipmissing
macro or a statement likeskipmissing
within a@linq
block that would automatically passskipmissing=true
to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.DataFrame
objects that would store the default value to use for theskipmissing
argument. By default it would befalse
, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to callskipmissing!(df, true)
once and then avoid repeating it.Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at https://github.com/JuliaStats/DataArrays.jl/issues/39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the
DataFrame
level is that it will have no effect on operations applied directly to column vectors, likesum(df.col)
. But that's better than nothing.Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard