JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Convience methods for `getproperty` and `setproperty` in DataFrames with new ownership rules #1753

Closed pdeffebach closed 5 years ago

pdeffebach commented 5 years ago

I have remarked about this on slack and in #1695. On @nalimilan's suggestion I am posting an issue to discuss how we can make df.col easier with non-literals.

Motivation: With #1695 (mostly) concluded, we will likely have the following two ways to get a single column from a DataFrame

I really like df.col syntax. I find it intuitive and requires little typing. However it only works with the actual literal col. If you have a variable x representing the symbol :col, you cannot do df.x to get the column :col. You also can't just do df[:, x] because that has different behavior -- a copy rather than the exact same vector.

I want to find a syntax for

df.col = f.(df.col)

where you use the variable x to represent :col.

Alternative solutions for getproperty:

Alternative solutions for `setproperty!:

df.col = map(v) do e 
    @match e begin
    ... pattern matching
    end
end

Is complicated enough as is, without putting it in setcol!(df, x, ...).

Finally I like the symmetry of getproperty and setproperty looking the same. df.col = f.(df.col) is elegant.

Solutions:

Ideally:

Pragmatically:

I'm at a bit of a dead end in terms of ideas for this. But I do think it's important. The average scientist codes in global scope, and we aren't going to get them to put things in functions if they have to re-write all their code in a less intuitive way in order to do that.

Thanks for reading my rant.

bkamins commented 5 years ago

Thanks for reading my rant.

I am done reading it 😄. Thank you for the input. This is indeed a hard nut to crack.

For an alternative to getproperty I thought of getcol(df, x). Note that x can be also a number.

For an alternative to setproperty! I was considering:

An alternative to setcol! could be mutate! with the semantics (or we could have both):

I am giving here my loose thoughts as I am still not sure what would be best.

Also I would explore macros rather in the DataFramesMeta.jl package.

pdeffebach commented 5 years ago

I have found a hack!

df = DataFrame(col = [1, 2, 3, 4, 5])
x = :col 
df.:($x)
julia> df.:($x)
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

This might be too ugly to show to users though.

bkamins commented 5 years ago

This is nice indeed. The only limitation of this approach (which means that we probably still need the other methods) is that x must be a Symbol (so it cannot be an integer nor you cannot pass an expression that should be evaluated)

pdeffebach commented 5 years ago

We can define the following:

import Base.getproperty
 getproperty(df::AbstractDataFrame, i::Int) = DataFrames.columns(df)[i]

But the evaluated symbol is a good point.

Just to be clear, there is no way to have a function like

col(df, :x1) = ... 

That has the same symmetry as df.x1 in terms of getproperty or setproperty? As in, this is a limitation in the way Julia parses expressions and should be done through a macro?

Another soluttion, that I like a lot, is having an object where you can overload getindex and setindex however you want.

cols(df)[:x] = ... # should work, right?
bkamins commented 5 years ago

cols(df)[:x] is a possible idea. Let us wait what other people think about it. (there is a question what is the best name for it, but the idea that cols(df) returns a column-oriented is possible IMO)

bkamins commented 5 years ago

Also removal of df[:col] is orthogonal to "data frame ownership" - so if some important arguments in favor of retaining it are raised this probably could be reconsidered as not a single line of code was written yet to remove it 😄.

EDIT I am writing it as I keep answering to different questions in several places and people keep writing df[:col] although already for a long time we support df.col.

oxinabox commented 5 years ago

I do not understand what is wrong with @view df[:, :col] or if a variable @view df[:, x].

It matches well with how views work in the rest of the language. And doesn't it already exist with the behavior you want?

bkamins commented 5 years ago

I agree with what @oxinabox says, and I was voting for removing df[:col] because it is simply inconsistent with the rest of the design, but I just acknowledge that for some reason people keep writing df[:col] (maybe it is only because in the past it was the only option).

pdeffebach commented 5 years ago

@bkamins you are right, this my be premature considering we haven't even started deprecating df[:col] yet.

You correctly note that this issue is really about the planned deprecation of df[:col], but I think it does have to do with ownership in the dense that with the planned changes in #1695 df.x is going to have different behavior than df[:, x] (one will copy and one will not). So motivating this discussion is a way to ensure we have the right methods that work in global and local scope.

Here is my reasoning regarding @view df[:, x]

bkamins commented 5 years ago

To be clear neither df.col, neither view(df, :, :col) nor df[:, :col] are going to change their meaning.

We are considering to remove df[:col] and view(df, :col). I guess that no one complains about removing view(df, :col) so I will concentrate on df[:col]. What problems to the users will removal of df[:col] cause:

@view df[:, :col] and df[:, :col] are valid alternatives, but neither of them solves the LHS (assignment) problem, because if they are on LHS they will update the old vector to have val values, not rebind a new vector to :col name. The problem is best understood I think when you try to rewrite the following code assuming df[:, :col] is deprecated:

df = DataFrame()
col = "my fancy name"
df[Symbol(col)] = 1:10

(of course adding setcol! would allow to handle this without a problem but it would not use assignment syntax)

oxinabox commented 5 years ago

Riiiight, ok, i hadn't considered the LHS problem. I can see the argument not for not deprecating setindex!(df, data, column::Symbol). Crazy idea might be to be asymmetric about that, so getindex(df, column::Symbol) might not work (prob give a helpful warning). Reason it might be sensible to not have this getindex, is that it would be awkward.

  1. it should be direct (viewish) column access since setindex!(df, data, column::Symbol) is.
  2. it should be a copy since all other getindex are copies, unless marked with @view

OTOH, my life would be easier if we left that getindex and setindex exactly as they are

bkamins commented 5 years ago

OTOH, my life would be easier if we left that getindex and setindex exactly as they are

That is my point - we have consistency against convenience issue here.

nalimilan commented 5 years ago

An available syntax is df[SOMETHING, col], with SOMETHING being a replacement for :. which would indicate no copy should be done. I haven't been able to find a good idea for that special object, though. It could be an name like inplace/view, or any symbol (available or used elsewhere with a different meaning) like +, *, ~, !, ^, | or even (). ! might kind of make sense for the similarity with f!.

bkamins commented 5 years ago

! makes most sense for me as it is a consistent with other notations.

oxinabox commented 5 years ago

Woah, that is so crazy, it might just work.

bkamins commented 5 years ago

So might go this way:

oxinabox commented 5 years ago
df[!, :col] gets a column directly (this is simple);

@view df[!, :col] gets a view of this column (this is still simple);

Actually its not, from a DataFrames as there own thing, the raw column vector df[!, :col] is a viewish thing. @view df[!, :col] is equiv to x::Vector = df[!, :col]; @view x[:] Which is a full length view of a Vector. Which from a DataFrames as thier own, is a view of a view, which is 1.) Weird 2.) Kind of pointless. (In general for a vector @view x[:] is pointless).

As such it might be that @view df[!, :col] should be exactly the same as df[!, :col].

df[!, cols] is problematic - should it be a SubDataFrame or a DataFrame (I opt for a DataFrame but @nalimilan pointed out that it might be better to use a SubDataFrame)

One factor worth considering is: should df[!, cols1][!, cols2][!, col] = xs should mutate the original DataFrame? as if df[!,col]=xs were called

nalimilan commented 5 years ago

As such it might be that @view df[!, :col] should be exactly the same as df[!, :col].

I don't think we should really be concerned about this since it's a corner case. I'd tend to go with the most consistent solution. We could even throw an error for now.

One factor worth considering is: should df[!, cols1][!, cols2][!, col] = xs should mutate the original DataFrame? as if df[!,col]=xs were called

If we return a SubDataFrame, it will throw an error. If we return a DataFrame, it won't mutate the original one.

My argument to return a SubDataFrame is that it makes it obvious that column vectors are shared (which we generally want to avoid for DataFrame in the ownership approach). The downside could be that it's more convenient to work with a DataFrame, but I'm not sure in what situations that would be the case (the example above is a bit convoluted).

bkamins commented 5 years ago

So here is my point. We deprecate df[:col] in favor of df[!, :col] and df[cols] in favor of df[!, cols] the benefit of this approach are:

Note that @view x[:] is a valid syntax and we should support it (as it is also supported in Base). Also note that df can be a DataFrame but it also can be a SubDataFrame (that is why I have said above to inherit all the functionality from current getindex and setindex! taking a single dimensional argument and they consistently handle all these cases).

So for example in the future:

df[!, cols1][!, cols2][!, col] = xs

should do exactly the same what:

df[cols1][cols2][col] = xs

does currently.

EDIT By "we deprecate" in the first sentence I mean a direct 1-to-1 deprecation without any change in the functionality.

bkamins commented 5 years ago

The only drawback is that df[!, cols] will create a DataFrame that shares the columns with the source, and I understand this is a concern of @nalimilan. Therefore we could as well allow df[!, col] but disallow df[!, cols] (and require users to call select that could be soon introduced).

EDIT The only problem will be current calls like df[cols] = x will not have a convenient way to express (but I think this is something that is very rarely if not never used).

oxinabox commented 5 years ago

We could always throw an error for df[!, cols]

Perhaps though for now, we don't touch df[:col], leaving it as is. Make a release with the other new features and return to it again later.

bkamins commented 5 years ago

This is what we do with the only change that in https://github.com/JuliaData/DataFrames.jl/pull/1742 currently calling df[cols] copies the columns.

pdeffebach commented 5 years ago

It would be cool to see how confused people are about df[:col] returning a vector and df[1] returning a row.

We could emphasize that df[:col] is only a replacement for df.col and just avoid defining df[cols::Vector{Symbol}] at all. You can't do that with getproperty and you can't use integer inputs with getproperty either.

There is still some semblance of logic because our goal is just to have a df.col <--------> df[:col] symmetry.

Would it make sense to have a feature request to Julia to add some sort of getproperty with evaluation?

nalimilan commented 5 years ago

It would be cool to see how confused people are about df[:col] returning a vector and df[1] returning a row.

There's no way you'll convince me of doing this madness. We're not Pandas! :-p

Also that would mean there's no short syntax for df[1], which would be weird.

Would it make sense to have a feature request to Julia to add some sort of getproperty with evaluation?

What do you mean?

nickeubank commented 5 years ago

FWIW:

bkamins commented 5 years ago

Just to expand on setcol! we possibly could have both setcol!(df, col, v) and setcol!(fun, df, col) where the latter can be used with do syntax.

Also df[:, col] = v will be available to change the value of the column in-place (setcol! is only needed for replacing the column with a new vector).

bkamins commented 5 years ago

While we are unclear what to do with df[col] and df[cols] I think it is good to outline what other general methods should be added (as they are needed anyway an largely cover the usecases of df[col] and df[cols] - except df[col] to get a raw vector). My current list is the following (ColSelector is anything that currently deletecols! would accept):

E.g. here mutate! would give us a replacement of df[col] = val (no matter if we keep it or not).

bkamins commented 5 years ago

@pdeffebach can this be closed given the new rules in https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/lib/indexing.md?

bkamins commented 5 years ago

Closing this - we have df[!, col] to handle programmatic direct column access.