JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

Rename `All` to `Union`, or to `Cols`? #16

Closed pdeffebach closed 4 years ago

pdeffebach commented 4 years ago

It ocurred to me that a statement like All(Not(:x), :y) is a bit of an oxymoron. Also, if one were to construct a statement outside a transform-like function call you would see

All(Not(:x), :y)

Which raises the question, All of what?

Perhaps it could be Union(Not(:x), :y). This reads better in my opinion.

piever commented 4 years ago

One issue is that All() is used to select all columns, I think Union() would be very confusing in that sense (one would expect the empty set with Union()).

pdeffebach commented 4 years ago

We could deprecate that in favor of Union(:). This is just an idea, I'm not wedded to it but think it might be helpful.

bkamins commented 4 years ago

I agree that Union is a better name. Actually we had a bug in DataFrames.jl because All() is actually All(:) that was recently fixed. However, Union is not the best name as it is used by Base, so probably All should stay.

pdeffebach commented 4 years ago

Thoughts on Cols? It can be thought of as a container for Columns.

Cols(Not(:id:), r"^t", [:income, :consumption])

Makes it pretty clear that we are building a container of columns.

CameronBieganek commented 4 years ago

I think that Cols is the best choice. The docstring for All is actually incorrect:

All(cols...) Select the union of the selections in cols.

All creates a tuple of sets of columns, not the union of the sets. For example,

All(:x, :y) != All(:y, :x)
All(:y, :) != All(:)

In DataFrames.jl the ordering produced by All is used to choose the order of the columns when indexing.

The actual behavior of All is this:

I think this behavior is much better captured by the Cols name. Then Cols() would select zero columns and Cols(:) would select every column.

Tagging @bkamins, since this would be a breaking change.

EDIT: Of course maybe the docstring could just be updated. The lack of context of the word All shouldn't generally be a problem, since it pretty much always appears inside an indexing or select expression. And if All is changed to Cols, then it might make sense to change Not to NotCols, because Not can occur without an All/Cols and is subject to the same criticism of "not what?".

Here's another idea. Use Cols instead of All and write a Base.:!(::Cols) method to replace Not. Example:

df = DataFrame(r=1, x1=2, x2=3, y=4)
df[:, Cols(!Cols(r"x"), :)]

I've also thought that it might make sense to just use a literal tuple instead of All. However, I think if I saw df[:, (:x, :x)], then I would expect column :x to be repeated.

bkamins commented 4 years ago

@piever - what do you think about it? I think that in DataFrames.jl it is not a problem to make this change now if it were preferred (but after 1.0 release this will be a problem), but this is a DataAPI.jl change that would have a larger impact.

CC @nalimilan

nalimilan commented 4 years ago

Cols isn't bad, and it would make more sense with broadcasting, e.g. Cols(Not(:x)) .=> sum (since Not(:x) .=> sum cannot work). But should Cols() still be equivalent to Cols(:)? That's less explicit than All(), though probably not the end of the world.

@CameronBieganek Not exists outside of DataAPI and can be used with any array, so I think we should keep that name.

bkamins commented 4 years ago

If we go for Cols then I think Cols() should not select any columns.

Now as I think of it in general - actually if JuliaDB.jl wants to keep All it is OK. We can just allow Cols as a new selector here and use it in DataFrames.jl. Then there is another question if we deprecate All or keep it as a legacy.

piever commented 4 years ago

I confess my main gripe with the status quo is that All() meaning all columns and All(args...) meaning more or less the union is a bit inconsistent (empty union should be no columns), but also, entirely my fault :) https://github.com/JuliaComputing/IndexedTables.jl/pull/123. Before, it used to be called Join.

These references may be relevant: https://github.com/JuliaComputing/IndexedTables.jl/issues/112 and https://github.com/JuliaComputing/IndexedTables.jl/pull/119. There is a bit of a discussion comparing with special selectors from dplyr.

For me to understand, the behavior would remain unchanged other than for Cols(), so say Cols([:x, :y], :z) would be the same as Cols(:x, :y, :z)? From the point of view of JuliaDB that's an important use: the main use of All selector is "unnesting", because (Not(:x), :y) would get a nested thing otherwise (with two columns, the first one in turn being a StructArray that corresponds to Not(:x)).

As an aside, as there is no nesting issue, I thought maybe DataFrames could just use a list, say select(df, [Not(:x), :y]), but maybe one wants to wrap things in a special type to overload broadcasting or things like that.

bkamins commented 4 years ago

Yes - I assume that Cols would do unnesting.

Now we need Cols because, as you note [Not(:x), :y] .=> fun should have a different behavior than Cols(Not(:x), :y) .=> fun.

bkamins commented 4 years ago

So in summary - do we introduce Cols? (it does not require removing All as this can be a separate step, though in DataFrames.jl I will then deprecate All and allow Cols if we go this way).

In DataFrames.jl the functionality of Cols will be exactly like All except that Cols() will select no columns.

If we go for it I will make a PR here.

CameronBieganek commented 4 years ago

Not exists outside of DataAPI and can be used with any array, so I think we should keep that name.

Yeah, I agree that the name Not is fine. NotCols is kind of unwieldy and df[:, Cols(!Cols(r"x"), :)] looks goofy.

Cols isn't bad, and it would make more sense with broadcasting, e.g. Cols(Not(:x)) .=> sum (since Not(:x) .=> sum cannot work).

Is the reason that Not(:x) .=> sum cannot work because you would have to pirate ::Not in order to add broadcasting? It seems unfortunate to have to rely on InvertedIndices.jl for the definition of Not, since then you can't make it broadcastable. Would it make sense to add Not to DataAPI.jl and just let InvertedIndices.Not be its own thing?

On the other hand, given that Cols is needed to make r"x" and : broadcastable (along with the future curried versions of startswith and endswith), maybe it makes sense for Cols to be broadcastable and for all the other column selectors (including Not and Between) to not be broadcastable.

bkamins commented 4 years ago

Regarding adding broadcasting to there is open https://github.com/mbauman/InvertedIndices.jl/issues/15, so maybe comment also there? (I guess if it gets more support it might be added)

CameronBieganek commented 4 years ago

Related comment here:

https://github.com/JuliaData/DataFrames.jl/issues/2171#issuecomment-618089821

bkamins commented 4 years ago

Just as an additional comment, although : .=> fun does not work, this names(df) .=> fun works (similarly for all other selectors). It is a bit more cumbersome but works without any magic.