Open nalimilan opened 4 years ago
I just wanted to point out that Stata has two different commands for row-wise vs, column-wise transforms (gen v.s. egen). That is something to think about if that would simplify thing, e.g. there could be a row version of
transform
, called, say,alter
.
@matthieugomez Indeed. My proposal is basically a way to combine Stata's gen
and egen
in a single expression, using $
to distinguish them. Anyway I think I would prefer specifying something like ByRow
in @transform
/@select
(whatever the exact syntax) than a separate function, as it makes it easier to understand the API than adding functions with totally different names.
Not allowing mixing column-wise and row-wise operation like in my proposal would certainly be simpler. The main drawback would be that one would have to use broadcasted operations, e.g. x .- mean(x)
rather than x - mean($x)
. Maybe that's OK but that increases the complexity.
* how would you turn-off this behaviour in DataFramesMeta.jl if it is not wanted
@bkamins I think you could just pass passmissing
and skipmissing
arguments to the macro, a bit like we discussed for functions.
* in `s - cumsum($x)` I was not clear what would happen in the scenario `s=[missing, missing, 1]` and `x=[1,2,3]`.
cumsum($x)
would become cumsum([1]) == [1]
, and then we would call cs = cumsum([1]); x -> s - cs
on each non-missing row (i.e. the last one with x == 3
), filling the first two rows with missing
.
However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names
lt
,leq
,gt
,geq
,eq
I must say I'm not a fan of ad-hoc solutions like that. This would solve some of the issues, but not all of them and people will have to learn both the custom functions and the general mechanism using passmissing
. Also we would lose the nice x < y
syntax, which is really lousy for a high-level language. A macro like @missingfalse
would be better, though it's still a bit verbose for DataFramesMeta I would say.
I agree with @matthieugomez, especially with upcoming multithreading in
transform
I think it would be very hard to maintain feature parity if the two packages diverged too much.
@pdeffebach Yes that's a concern. Everything DataFramesMeta does should be translatable relatively directly to DataFrames commands so that we don't implement things twice.
I think @nalimilan's proposal is good, but is not functionally different from a
DropMissing([:x1, :x2]) => fun
syntax that could go in the data frames mini-language.I do agree with Milan that
ByRow
being the default would solve a lot of these problems.
Actually I think my proposal is a bit different, as you need to run first column-wise operations and then insert them in the row-wise function. I'm not sure how it could be translated into the select
/transform
syntax...
OK - given these comments can you explain why you think:
coalesce(x>y, false)
is bad?
I don't think it's bad as a low-level way of doing things, as it always works and is clear about what it does. I just think for DataFramesMeta we need a general solution which allows you to write the code as if there weren't any missing values (possibly adding an argument/macro to enable that behavior), so that you don't need to constantly adjust the code using various strategies which are hard to remember.
OK - for DataFramesMeta.jl this is fine. But then what for DataFrames.jl?
We started a discussion with adding skipmissing
and passmissing
keyword arguments. For me - from the discussion - it turns out that other options like coalesce(x>y, false)
are not that much longer and they are still clear what is going on as you noted. So is then your suggestion to leave things in DataFrames.jl as they are and just improve the documentation to teach users properly use:
skipmissing
pasmissing
coalesce
?
(maybe just adding the skipmissingfun
wrapper - probably with a shorter name - as @pdeffebach proposed as it cannot be handled using the options above when you pass multiple columns in source => fun => destination
)
proposed as it cannot be handled using the options above when you pass multiple columns in
source => fun => destination
)
It can! I wrote a PR for that in Missings that does this.
I know this PR, but it does not do what we need in DataFrames.jl:
get three things:
- a list of vectors as positional arguments
- a NamedTuple of vectors
- an AbstractDataFrame
and that solution does not handle them all unfortunately (otherwise this is exactly what we need :+1: )
What we could do is:
NamedTuple
of vectors support in Missings.jl - @pdeffebach => would you agree to add it thereAbstractDataFrame
support in DataFrames.jl - as this would not be a type piracy => @pdeffebach => also you are then welcome to add this method (it should be just a view
of this data frame removing rows with missings)OK, so given also a short discussion with @pdeffebach on Slack:
skipmissings
and make a release of Missings.jl - I understand that @pdeffebach would start working on this. It would be good to do it relatively quickly, so that we can have skipmissings
added also to DataFrames.jl in 0.22 release.SkipMissing
wrapper of the form SkipMissing(source) => fun => destination
can be added that will do the following:
source
columnsskipmissings
to themskipmissings
source
leaving missing
values in rows that were skipped by skipmissings
The SkipMissing
wrapper would be supported only in select
and transform
and if GroupedDataFrame
was passed to it it would not employ the fast path (but I guess this is most likely not a use case for fast path anyway - right?).
Sounds like a good idea for DataFrames and probably for Tables.jl in general.
For DataFramesMeta, here's a simpler proposal than the one I made above, and which should be easy to implement using DataFrames functions. @transform
/@select
/@combine
would be either row-wise or column-wise, with some syntax like ByRow
/ByCol
, @byrow
/@bycol
or byrow=true
/bycol=true
to switch between these (not sure which one should be the default, that's a separate discussion). In some cases this is a bit annoying as you need to use broadcasting explicitly when operating column-wise, e.g. in @filter(df, x .> mean(x))
or @transform(df, y = x .- mean(x))
but maybe it's not the end of the world as these are not the most common operations.
Missing values would be propagated or skipped by default. For row-wise operation, (the equivalent of) passmissing=true
would be the default (and we could throw an error if ismissing
is used to help users). For column-wise operation, skipmissing=true
would be the default; additionally for @select
/@transform
, if the function returns a vector (of the same length as its input), the passmissing=true
behavior would be used. This will allow "doing the right thing" for both y = mean(x)
and y = x .- mean(x)
. So in practice ByRow(f)
and x -> f.(x)
would be equivalent, which is an important property. One would be able to pass passmissing
and skipmissing
manually as arguments to override the default. This would be needed notably when recoding missing values, using ismissing
or things like replace(x, missing => 0)
(which is actually the main drawback of propagating missing values).
Does that sound reasonable?
Independent of this a
SkipMissing
wrapper of the formSkipMissing(source) => fun => destination
can be added that will do the following:1. take `source` columns 2. apply `skipmissings` to them 3. perform computation and pseudo broadcasting into length of the input after performing `skipmissings` 4. finally map the result of step 3 into original length of `source` leaving `missing` values in rows that were skipped by `skipmissings`
@bkamins What you describe sounds useful in general, and in particular for my last DataFramesMeta proposal (though note I suggest filling all rows and not just non-missing rows when the function returns a scalar). But I think we should find another name for this, as I would expect SkipMissing(f)
to be equivalent to f ∘ skipmissing
. So this is unlikely to be accepted in Base. SkipAndFillMissing
would be accurate but a bit verbose. SkipFillMissing
? Or just FillMissing
?
The
SkipMissing
wrapper would be supported only inselect
andtransform
and ifGroupedDataFrame
was passed to it it would not employ the fast path (but I guess this is most likely not a use case for fast path anyway - right?).
The fast path only supports single-row reductions anyway, right? So that doesn't apply here.
@nalimilan - thank you for the comments.
I would expect
SkipMissing(f)
to be equivalent tof ∘ skipmissing
The point is that @pdeffebach proposed to have SkipMissing
not on fun
but on source
, so you would write SkipMissing([:x1, :x2]) => cor => :cor
for instance, and this allows us to define it more flexibly (but I agree we could also have a better name if we can think of any)
I suggest filling all rows and not just non-missing rows when the function returns a scalar
I was afraid that it would be confusing that returning a vector and returning a scalar behave differently. But maybe this is OK (on purpose I excluded combine
from the list as in that case we would hit many more problems, with select
and transform
things are easier).
I think we need to decide here which operation is more useful: your proposal or my proposal. I feel that my proposal gives a more consistent design, but maybe the usefulness of your proposal outweighs this shortcoming (but let us deeply discuss this, not to go back to it later - like we have recurring discussions if select
should be by row or by whole column by default - as when we make a decision it will be fixed for a long time - 😄).
In particular - given your semantics how would you achieve my result (i.e. fill only non-missing with the aggregate)?
The fast path only supports single-row reductions anyway, right? So that doesn't apply here.
If we accept your proposal (i.e. fill scalars everywhere) then there is no difference. If we accept my proposal (i.e. broadcast scalar only to non-missing slots) then there is a difference between:
select(df, :x , :y => sum ∘ skipmissing)
and
select(df, :x , SkipMissing(:y) => sum)
The point is that @pdeffebach proposed to have
SkipMissing
not onfun
but onsource
, so you would writeSkipMissing([:x1, :x2]) => cor => :cor
for instance, and this allows us to define it more flexibly (but I agree we could also have a better name if we can think of any)
OK. The problem is, SkipMissing(cols) => fun
already has a meaning, i.e. apply fun
to columns in vector cols
, excluding missing values if it contains some. Granted, it's a quite unlikely usage, but still...
In particular - given your semantics how would you achieve my result (i.e. fill only non-missing with the aggregate)?
In DataFramesMeta, you would just pass passmissing=true
explicitly. In DataFrames, you would use skipmissing
or skipmissings
manually instead of using SkipMissing(cols) => fun
/SkipFillMissing(cols) => fun
/whatever. In both, you could also return fill(value, length(x))
instead of just value
.
I agree this is an important point to settle. I suspect that in practice when doing reductions rows with missing values should get the same value as others. This is the case for example if you want to add a variable containing group means. That's what Stata's egen
does AFAICT (right @matthieugomez?). Can we find examples where we would like the opposite behavior (i.e. rows with missings are filled with missing
instead of with the returned scalar)?
SkipMissing(cols)
Ah - I have forgotten that it is in Base, but it is just unexported. Then yes - we should not change the meaning of this name.
In both, you could also return
fill(value, length(x))
instead of just value.
Fair point, except that in ByRow
you do not have an access to length(x)
, especially in GroupDataFrame
context. (but you can get it using other means anyway so it would be OK)
Can we find examples where we would like the opposite behavior
Thinking of it - you are probably right. As later if you use this reduction with missing
in the original column it will become missing
anyway.
The only reduction I can think of is:
transform(df, SkipMissing(:x) => x -> true)
where we get true
indicator for non-missing, but of course it does not make much sense to do such a transformation.
In Stata, mean of a variables creates a variable equal to the mean on non-missing rows, and missing values otherwise. Edit: no sorry it creases the mean for every rows.
I have also realized that, one issue with passmissing for functions that return vectors (as proposed by @nalimilan), is that lag will not work correctly.
is that lag will not work correctly.
can you please give an example what you exactly mean?
@matthieugomez I was trying to write a post comparing
gen y = mean(x)
gey y = mean(x) if !missing(x)
But I don't have a Stata installation at the moment. Can you confirm when the mean
is different?
I think the proposed behavior is fairly close to idiomatic stata. Enough that you can translate one-for-one, but I want to make sure .
@nalimilan Yes in the second case the mean is only given on rows that are not missing.
@bkamins sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]
sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]
That's fine, though. lag
doesn't error with missing
values so there is no need to do it (assuming we don't make DropMissing
the default, which I don't think we should yet).
lag
s are hard anyways, you have to remember grouping etc.
Yes, lag
/lead
/diff
are another case that would require opting-out explicitly of missing values propagation (if we make it the default). If we think this kind of situation (in addition to ismissing
, etc.) is too common or too confusing, we could avoid it by default and require people to use skipmissing=true
/ passmissing=true
explicitly. Anyway we could first try this system and later see whether it should be the default.
But for simplicity I think it would be better to have a single argument called skipmissing
that would do both, since we concluded that there are no major use cases in which passmissing
and skipmissing
would need to be separate. For row-wise operation, skipmissing
somewhat makes sense as a way to say "only apply the operation to non-missing rows (and fill others with missing
)". It's a slight abuse but probably OK. For column-wise operation, skipmissing
would be exactly what happens to input columns, and the passmissing
behavior would just be a consequence of that: if the function returns a single scalar, we can fill all rows, including missing ones (passmissing=false
in our terminology), but if it returns a vector of the same length as its input, we can only fill non-missing rows (passmissing=true
in our terminology).
(if we make it the default)
I understand you are talking about DataFramesMeta.jl here - right. In DataFrames.jl I do not think we will make it a default as it would be breaking.
But for simplicity I think it would be better to have a single argument called
Again - I understand you mean it for DataFramesMeta.jl - right? As for data frames, we want to have a wrapper around column selector in source => fun => destination
(we need a name as we cannot use SkipMissing
).
Regarding the discussion #2484.
I think that where
should have the following specification (I leave out some sanity checks but give a simplified idea of the impementation):
where(df::DataFrame, args...) = df[isequal.(.&(eachcol(select(df, args..., copycols=false))), true), :]
where(gdf::GroupedDataFrame, args...) = parent(df)[isequal.(.&(eachcol(select(gdf, args..., copycols=false, keepkeys=false))), true), :]
An element of args
can only be of the form cols => fun
and we transform it to cols => fun => x_i
(to make sure we have unique output column names).
Regarding the comment by @nalimilan:
One potential issue is that having
where
skipmissing
values but other functions likecombine
,select
andtransform
not skip them
I do not propose to skip missing values. What I propose is to use isequal
not ==
for testing (and this is a logically valid rule). Actually in my codes I normally use isequal
because of this for logical conditions as a rule.
We could add a errror_on_missing
(or something similar) kwarg to where
to choose if we use isequal
or ==
to do the test (but I would still make it default to false
, i.e. use isequal
by default)
You need gdf
above, as transformations are on the group level when given a grouped data frame above.
I think we can use isequal
without addressing transform
and select
. We aren't skipping missing during the transformation, just when we choose which rows to keep.
I don't think we are ready to address transform
and select
yet. There hasn't been consensus on behavior. better to implement something in Missings.jl here and give users a chance to see if they like it.
You need
gdf
above
fixed, it was a typo
+1. It would also be nice to have a where!
version, as well as a view
kwarg for where
.
Last thing is the name. I like where
. The only issue is that it is used
for special syntax in Julia. Is that a problem? In particular, I think it
creates some issues with using Lazy’s macro @>
, but maybe it can be
fixed.
On Mon, Oct 19, 2020 at 6:27 AM Bogumił Kamiński notifications@github.com wrote:
You need gdf above
fixed, it was a typo
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-712156073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXPIIASGWALSXHZIEZ3SLQ5FXANCNFSM4OTTLDUA .
It would also be nice to have a
where!
version, as well as aview
kwarg forwhere
.
Sure - we would mirror that - thank you for raising this, as it is easy to forget.
I like
where
I think if we like it (and I guess we do not have a better name) then we should try to make sure that the "ecosystem" works with it. For sure this is not a problem for DataFramesMeta.jl. @pdeffebach - have you experimented with this name in DataFramesMeta.jl.
where
works with linq
, but the Lazy pipings all work with @where
so there hasn't been a conflict.
I do not propose to skip missing values. What I propose is to use
isequal
not==
for testing (and this is a logically valid rule). Actually in my codes I normally useisequal
because of this for logical conditions as a rule.We could add a
errror_on_missing
(or something similar) kwarg towhere
to choose if we useisequal
or==
to do the test (but I would still make it default tofalse
, i.e. useisequal
by default)
Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true)
is really skipping rows for which x
is missing
. And I would be inclined to call the argument skipmissing=true
instead of errror_on_missing=false
(that would be more standard given our terminology).
TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where
shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine
/select
/transform
should. Of course it's much simpler and obvious what to do to handle missing values with where
, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine
/select
/transform
in the presence of missing values.
Is not it what dplyr does already (ie skip missing in filter but not in mutate)?
On Mon, Oct 19, 2020 at 1:13 PM Milan Bouchet-Valat < notifications@github.com> wrote:
I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.
We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)
Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true) is really skipping rows for which x is missing. And I would be inclined to call the argument skipmissing=true instead of errror_on_missing=false (that would be more standard given our terminology).
TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine/select/transform should. Of course it's much simpler and obvious what to do to handle missing values with where, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine/select/transform in the presence of missing values.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-712415696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNZOOAMIVCDP2VM3TLSLSMX3ANCNFSM4OTTLDUA .
Well I'd say you're playing with words. :-)
The point is that we are not skipping missing in the sense how e.g. select
would skip them. As for select
skipmissing
means REMOVE FROM THE COLLECTION or IGNORE THEM while in where
it is TREAT THEM AS FALSE. And this is the point of my reasoning.
This is a different operation that we perform (skipping missings in the sense of select
in where
would make no sense).
skipmissing=true
For the above reason I have not proposed skipmissing
kwarg but some other name (of course I am not attached to the proposed one as it is not nice obviously).
My point is that the decision what to do in select
and how to implement where
are logically orthogonal although they have very similar end effect.
Regarding where
for GroupedDataFrame
we have two options:
parent
(i.e. it would internally rely on select
; in this scenario subsetting groups in GroupedDataFrame
is not allowed)combine
; in this scenario subsetting groups in GroupedDataFrame
is allowed)Which one do we prefer?
select
then filter
.
It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like
filter(:col => x -> coalesce(x > 1, false), df)
orcombine(gd, :col => (x -> sum(skipmissing(x)))
isn't ideal. One proposal to alleviate this is https://github.com/JuliaData/DataFrames.jl/issues/2258: add askipmissing
argument to functions likefilter
,select
,transform
andcombine
to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.That would be one step towards being more user-friendly, but one would still have to repeat
skipmissing=true
all the time when dealing with missing values. I figured two solutions could be considered to improve this:@linqskipmissing
macro or a statement likeskipmissing
within a@linq
block that would automatically passskipmissing=true
to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.DataFrame
objects that would store the default value to use for theskipmissing
argument. By default it would befalse
, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to callskipmissing!(df, true)
once and then avoid repeating it.Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at https://github.com/JuliaStats/DataArrays.jl/issues/39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the
DataFrame
level is that it will have no effect on operations applied directly to column vectors, likesum(df.col)
. But that's better than nothing.Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard