Closed pdeffebach closed 4 years ago
One issue is that All()
is used to select all columns, I think Union()
would be very confusing in that sense (one would expect the empty set with Union()
).
We could deprecate that in favor of Union(:)
. This is just an idea, I'm not wedded to it but think it might be helpful.
I agree that Union
is a better name. Actually we had a bug in DataFrames.jl because All()
is actually All(:)
that was recently fixed. However, Union
is not the best name as it is used by Base, so probably All
should stay.
Thoughts on Cols
? It can be thought of as a container for Columns
.
Cols(Not(:id:), r"^t", [:income, :consumption])
Makes it pretty clear that we are building a container of columns.
I think that Cols
is the best choice. The docstring for All
is actually incorrect:
All(cols...) Select the union of the selections in cols.
All
creates a tuple of sets of columns, not the union of the sets. For example,
All(:x, :y) != All(:y, :x)
All(:y, :) != All(:)
In DataFrames.jl the ordering produced by All
is used to choose the order of the columns when indexing.
The actual behavior of All
is this:
A
, B
, and C
be sets of columns.All(A, B, C)
creates a tuple t = (A, B\A, C\(A ∪ B))
.df[:, All(A, B, C)]
becomes df[:, vcat(t...)]
.I think this behavior is much better captured by the Cols
name. Then Cols()
would select zero columns and Cols(:)
would select every column.
Tagging @bkamins, since this would be a breaking change.
EDIT:
Of course maybe the docstring could just be updated. The lack of context of the word All
shouldn't generally be a problem, since it pretty much always appears inside an indexing or select
expression. And if All
is changed to Cols
, then it might make sense to change Not
to NotCols
, because Not
can occur without an All
/Cols
and is subject to the same criticism of "not what?".
Here's another idea. Use Cols
instead of All
and write a Base.:!(::Cols)
method to replace Not
. Example:
df = DataFrame(r=1, x1=2, x2=3, y=4)
df[:, Cols(!Cols(r"x"), :)]
I've also thought that it might make sense to just use a literal tuple instead of All
. However, I think if I saw df[:, (:x, :x)]
, then I would expect column :x
to be repeated.
@piever - what do you think about it? I think that in DataFrames.jl it is not a problem to make this change now if it were preferred (but after 1.0 release this will be a problem), but this is a DataAPI.jl change that would have a larger impact.
CC @nalimilan
Cols
isn't bad, and it would make more sense with broadcasting, e.g. Cols(Not(:x)) .=> sum
(since Not(:x) .=> sum
cannot work). But should Cols()
still be equivalent to Cols(:)
? That's less explicit than All()
, though probably not the end of the world.
@CameronBieganek Not
exists outside of DataAPI and can be used with any array, so I think we should keep that name.
If we go for Cols
then I think Cols()
should not select any columns.
Now as I think of it in general - actually if JuliaDB.jl wants to keep All
it is OK. We can just allow Cols
as a new selector here and use it in DataFrames.jl. Then there is another question if we deprecate All
or keep it as a legacy.
I confess my main gripe with the status quo is that All()
meaning all columns and All(args...)
meaning more or less the union is a bit inconsistent (empty union should be no columns), but also, entirely my fault :) https://github.com/JuliaComputing/IndexedTables.jl/pull/123. Before, it used to be called Join
.
These references may be relevant: https://github.com/JuliaComputing/IndexedTables.jl/issues/112 and https://github.com/JuliaComputing/IndexedTables.jl/pull/119. There is a bit of a discussion comparing with special selectors from dplyr.
For me to understand, the behavior would remain unchanged other than for Cols()
, so say Cols([:x, :y], :z)
would be the same as Cols(:x, :y, :z)
? From the point of view of JuliaDB that's an important use: the main use of All
selector is "unnesting", because (Not(:x), :y)
would get a nested thing otherwise (with two columns, the first one in turn being a StructArray
that corresponds to Not(:x)
).
As an aside, as there is no nesting issue, I thought maybe DataFrames could just use a list, say select(df, [Not(:x), :y])
, but maybe one wants to wrap things in a special type to overload broadcasting or things like that.
Yes - I assume that Cols
would do unnesting.
Now we need Cols
because, as you note [Not(:x), :y] .=> fun
should have a different behavior than Cols(Not(:x), :y) .=> fun
.
So in summary - do we introduce Cols
? (it does not require removing All
as this can be a separate step, though in DataFrames.jl I will then deprecate All
and allow Cols
if we go this way).
In DataFrames.jl the functionality of Cols
will be exactly like All
except that Cols()
will select no columns.
If we go for it I will make a PR here.
Not
exists outside of DataAPI and can be used with any array, so I think we should keep that name.
Yeah, I agree that the name Not
is fine. NotCols
is kind of unwieldy and df[:, Cols(!Cols(r"x"), :)]
looks goofy.
Cols
isn't bad, and it would make more sense with broadcasting, e.g.Cols(Not(:x)) .=> sum
(sinceNot(:x) .=> sum
cannot work).
Is the reason that Not(:x) .=> sum
cannot work because you would have to pirate ::Not
in order to add broadcasting? It seems unfortunate to have to rely on InvertedIndices.jl
for the definition of Not
, since then you can't make it broadcastable. Would it make sense to add Not
to DataAPI.jl and just let InvertedIndices.Not
be its own thing?
On the other hand, given that Cols
is needed to make r"x"
and :
broadcastable (along with the future curried versions of startswith
and endswith
), maybe it makes sense for Cols
to be broadcastable and for all the other column selectors (including Not
and Between
) to not be broadcastable.
Regarding adding broadcasting to there is open https://github.com/mbauman/InvertedIndices.jl/issues/15, so maybe comment also there? (I guess if it gets more support it might be added)
Related comment here:
https://github.com/JuliaData/DataFrames.jl/issues/2171#issuecomment-618089821
Just as an additional comment, although : .=> fun
does not work, this names(df) .=> fun
works (similarly for all other selectors). It is a bit more cumbersome but works without any magic.
It ocurred to me that a statement like
All(Not(:x), :y)
is a bit of an oxymoron. Also, if one were to construct a statement outside atransform
-like function call you would seeWhich raises the question,
All
of what?Perhaps it could be
Union(Not(:x), :y)
. This reads better in my opinion.