Closed pdeffebach closed 5 years ago
Thanks for reading my rant.
I am done reading it 😄. Thank you for the input. This is indeed a hard nut to crack.
For an alternative to getproperty
I thought of getcol(df, x)
. Note that x
can be also a number.
For an alternative to setproperty!
I was considering:
setcol!(df, x, v)
or setcol!(df, x=>v)
setcol!(fun, df, x)
(this syntax can be used to handle a do block, fun
takes no parameters)An alternative to setcol!
could be mutate!
with the semantics (or we could have both):
mutate!(df, x=>v...)
(mutliple pairs allowed)mutate!(df; x=v...)
(mutliple kwargs allowed)mutate!(fun, df)
(this syntax can be used to handle a do block, fun
takes no parameters and returns something that would normally be accepted by DataFrame
constructor: Pair
or Pairs
, NamedTuple
of vectors, etc. - here the list would have to get defined)I am giving here my loose thoughts as I am still not sure what would be best.
Also I would explore macros rather in the DataFramesMeta.jl package.
I have found a hack!
df = DataFrame(col = [1, 2, 3, 4, 5])
x = :col
df.:($x)
julia> df.:($x)
5-element Array{Int64,1}:
1
2
3
4
5
This might be too ugly to show to users though.
This is nice indeed.
The only limitation of this approach (which means that we probably still need the other methods) is that x
must be a Symbol
(so it cannot be an integer nor you cannot pass an expression that should be evaluated)
We can define the following:
import Base.getproperty
getproperty(df::AbstractDataFrame, i::Int) = DataFrames.columns(df)[i]
But the evaluated symbol is a good point.
Just to be clear, there is no way to have a function like
col(df, :x1) = ...
That has the same symmetry as df.x1
in terms of getproperty
or setproperty
? As in, this is a limitation in the way Julia parses expressions and should be done through a macro?
Another soluttion, that I like a lot, is having an object where you can overload getindex
and setindex
however you want.
cols(df)[:x] = ... # should work, right?
cols(df)[:x]
is a possible idea. Let us wait what other people think about it.
(there is a question what is the best name for it, but the idea that cols(df)
returns a column-oriented is possible IMO)
Also removal of df[:col]
is orthogonal to "data frame ownership" - so if some important arguments in favor of retaining it are raised this probably could be reconsidered as not a single line of code was written yet to remove it 😄.
EDIT I am writing it as I keep answering to different questions in several places and people keep writing df[:col]
although already for a long time we support df.col
.
I do not understand what is wrong with
@view df[:, :col]
or if a variable @view df[:, x]
.
It matches well with how views work in the rest of the language. And doesn't it already exist with the behavior you want?
I agree with what @oxinabox says, and I was voting for removing df[:col]
because it is simply inconsistent with the rest of the design, but I just acknowledge that for some reason people keep writing df[:col]
(maybe it is only because in the past it was the only option).
@bkamins you are right, this my be premature considering we haven't even started deprecating df[:col]
yet.
You correctly note that this issue is really about the planned deprecation of df[:col]
, but I think it does have to do with ownership in the dense that with the planned changes in #1695 df.x
is going to have different behavior than df[:, x]
(one will copy and one will not). So motivating this discussion is a way to ensure we have the right methods that work in global and local scope.
Here is my reasoning regarding @view df[:, x]
@view df[:, :col]
returns a view of the entirety of df.col
, and doesn't return the vector itself. @view df[:, x]
we need to explain to them what a view
is. Additionally, these students are going to program in global scope, but we want them to make the switch to functions. This disincentives that switch. view
s will be annoyed as they switch from prototyping in global scope to local scope. getproperty
method in the first place. It seems inconsistent to have two different philosophies for indexing data frames in local and global scope (or with literals and variables).To be clear neither df.col
, neither view(df, :, :col)
nor df[:, :col]
are going to change their meaning.
We are considering to remove df[:col]
and view(df, :col)
. I guess that no one complains about removing view(df, :col)
so I will concentrate on df[:col]
. What problems to the users will removal of df[:col]
cause:
df
df.col
does not support getting columns by their number and does not support invalid idenfiersgetproperty(df, :col)
which solves the invalid identifier problem; we can consider adding getproperty(df, number)
support; the problem with this is that getproperty
is probably not an intuitive namegetcol
to have a better name; this should generally solve all the needs from the users on RHSdf
df.col = val
does not support getting columns by their number and does not support invalid idenfierssetproperty!(df, :col, val)
which solves the invalid identifier problem; we can consider adding setproperty!(df, number, val)
support; the problem with this is that setproperty!
is probably not an intuitive namesetcol!
to have a better name; however, this is a bit problematic as we will not be able to write some_function(df) = val
as this is not a valid syntax@view df[:, :col]
and df[:, :col]
are valid alternatives, but neither of them solves the LHS (assignment) problem, because if they are on LHS they will update the old vector to have val
values, not rebind a new vector to :col
name. The problem is best understood I think when you try to rewrite the following code assuming df[:, :col]
is deprecated:
df = DataFrame()
col = "my fancy name"
df[Symbol(col)] = 1:10
(of course adding setcol!
would allow to handle this without a problem but it would not use assignment syntax)
Riiiight, ok, i hadn't considered the LHS problem.
I can see the argument not for not deprecating setindex!(df, data, column::Symbol)
.
Crazy idea might be to be asymmetric about that, so getindex(df, column::Symbol)
might not work (prob give a helpful warning).
Reason it might be sensible to not have this getindex, is that it would be awkward.
setindex!(df, data, column::Symbol)
is.getindex
are copies, unless marked with @view
OTOH, my life would be easier if we left that getindex and setindex exactly as they are
OTOH, my life would be easier if we left that getindex and setindex exactly as they are
That is my point - we have consistency against convenience issue here.
An available syntax is df[SOMETHING, col]
, with SOMETHING
being a replacement for :
. which would indicate no copy should be done. I haven't been able to find a good idea for that special object, though. It could be an name like inplace
/view
, or any symbol (available or used elsewhere with a different meaning) like +
, *
, ~
, !
, ^
, |
or even ()
. !
might kind of make sense for the similarity with f!
.
!
makes most sense for me as it is a consistent with other notations.
Woah, that is so crazy, it might just work.
So might go this way:
df[!, :col]
gets a column directly (this is simple);@view df[!, :col]
gets a view of this column (this is still simple);df[!, cols]
is problematic - should it be a SubDataFrame
or a DataFrame
(I opt for a DataFrame
but @nalimilan pointed out that it might be better to use a SubDataFrame
)@view df[!, cols]
would mean - this is particularly tricky when someone would write @view df[!, :]
(:
as column selector dynamically resizes the view to reflect the changes in columns in the parent)df[!, :col] gets a column directly (this is simple);
@view df[!, :col] gets a view of this column (this is still simple);
Actually its not, from a DataFrames as there own thing, the raw column vector df[!, :col]
is a viewish thing.
@view df[!, :col]
is equiv to x::Vector = df[!, :col]; @view x[:]
Which is a full length view of a Vector.
Which from a DataFrames as thier own, is a view of a view, which is
1.) Weird
2.) Kind of pointless. (In general for a vector @view x[:]
is pointless).
As such it might be that @view df[!, :col]
should be exactly the same as df[!, :col]
.
df[!, cols]
is problematic - should it be a SubDataFrame or a DataFrame (I opt for a DataFrame but @nalimilan pointed out that it might be better to use a SubDataFrame)
One factor worth considering is:
should df[!, cols1][!, cols2][!, col] = xs
should mutate the original DataFrame
?
as if df[!,col]=xs
were called
As such it might be that
@view df[!, :col]
should be exactly the same asdf[!, :col]
.
I don't think we should really be concerned about this since it's a corner case. I'd tend to go with the most consistent solution. We could even throw an error for now.
One factor worth considering is: should
df[!, cols1][!, cols2][!, col] = xs
should mutate the originalDataFrame
? as ifdf[!,col]=xs
were called
If we return a SubDataFrame
, it will throw an error. If we return a DataFrame
, it won't mutate the original one.
My argument to return a SubDataFrame
is that it makes it obvious that column vectors are shared (which we generally want to avoid for DataFrame
in the ownership approach). The downside could be that it's more convenient to work with a DataFrame
, but I'm not sure in what situations that would be the case (the example above is a bit convoluted).
So here is my point. We deprecate df[:col]
in favor of df[!, :col]
and df[cols]
in favor of df[!, cols]
the benefit of this approach are:
df[:col]
and df[cols]
and we inherit this behavior)!
in a few places and all will work without any other change)Note that @view x[:]
is a valid syntax and we should support it (as it is also supported in Base). Also note that df
can be a DataFrame
but it also can be a SubDataFrame
(that is why I have said above to inherit all the functionality from current getindex
and setindex!
taking a single dimensional argument and they consistently handle all these cases).
So for example in the future:
df[!, cols1][!, cols2][!, col] = xs
should do exactly the same what:
df[cols1][cols2][col] = xs
does currently.
EDIT By "we deprecate" in the first sentence I mean a direct 1-to-1 deprecation without any change in the functionality.
The only drawback is that df[!, cols]
will create a DataFrame
that shares the columns with the source, and I understand this is a concern of @nalimilan. Therefore we could as well allow df[!, col]
but disallow df[!, cols]
(and require users to call select
that could be soon introduced).
EDIT The only problem will be current calls like df[cols] = x
will not have a convenient way to express (but I think this is something that is very rarely if not never used).
We could always throw an error for df[!, cols]
Perhaps though for now, we don't touch df[:col]
, leaving it as is.
Make a release with the other new features and return to it again later.
This is what we do with the only change that in https://github.com/JuliaData/DataFrames.jl/pull/1742 currently calling df[cols]
copies the columns.
It would be cool to see how confused people are about df[:col]
returning a vector and df[1]
returning a row.
We could emphasize that df[:col]
is only a replacement for df.col
and just avoid defining df[cols::Vector{Symbol}]
at all. You can't do that with getproperty
and you can't use integer inputs with getproperty
either.
There is still some semblance of logic because our goal is just to have a df.col <--------> df[:col]
symmetry.
Would it make sense to have a feature request to Julia to add some sort of getproperty
with evaluation?
It would be cool to see how confused people are about
df[:col]
returning a vector anddf[1]
returning a row.
There's no way you'll convince me of doing this madness. We're not Pandas! :-p
Also that would mean there's no short syntax for df[1]
, which would be weird.
Would it make sense to have a feature request to Julia to add some sort of
getproperty
with evaluation?
What do you mean?
FWIW:
df[:col]
, mostly use df.col
, and offer setcol!(df, x)
and getcol(df, x)
for when people want to use variables for column names. I think setcol
and getcol
are exceedingly readable and clear, and we avoid the "ascii salad" problem we get with the !
operator, which I can live with, but just seems... weird? Not intuitive? That's not a notation that allows a casual reader to read your code and clearly know what's going on...
getproperty
and setproperty
-- property is just too vague a term (so it isn't obvious what code is doing if you read it), and it requires lots of typing, both of which I think adds to confusion.Just to expand on setcol!
we possibly could have both setcol!(df, col, v)
and setcol!(fun, df, col)
where the latter can be used with do
syntax.
Also df[:, col] = v
will be available to change the value of the column in-place (setcol!
is only needed for replacing the column with a new vector).
While we are unclear what to do with df[col]
and df[cols]
I think it is good to outline what other general methods should be added (as they are needed anyway an largely cover the usecases of df[col]
and df[cols]
- except df[col]
to get a raw vector). My current list is the following (ColSelector
is anything that currently deletecols!
would accept):
select(df, col::ColSelector;copycolumns:Bool=true)
: create a new DataFrame
with selected columnsselect!
- the same as select
but without copycolums
kwarg and changes the DataFrame
in placemutate(df, ::Pair{Symbol,Any}..., copycolumns::Bool=true)
: create a new DataFrame
with added columns specified by Pairs
mutate!
: the same as mutate
but without copycolums
kwarg and changes the DataFrame
in placedeletecols
- the same as deletecols!
but with copycolumns::Bool=true
kwarg and creating a new DataFrame
E.g. here mutate!
would give us a replacement of df[col] = val
(no matter if we keep it or not).
@pdeffebach can this be closed given the new rules in https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/lib/indexing.md?
Closing this - we have df[!, col]
to handle programmatic direct column access.
I have remarked about this on slack and in #1695. On @nalimilan's suggestion I am posting an issue to discuss how we can make
df.col
easier with non-literals.Motivation: With #1695 (mostly) concluded, we will likely have the following two ways to get a single column from a DataFrame
df[:, :col]
: a copydf.col
a non-copy.df[:col]
: deprecated.I really like
df.col
syntax. I find it intuitive and requires little typing. However it only works with the actual literalcol
. If you have a variablex
representing the symbol:col
, you cannot dodf.x
to get the column:col
. You also can't just dodf[:, x]
because that has different behavior -- a copy rather than the exact same vector.I want to find a syntax for
df.col = f.(df.col)
where you use the variable
x
to represent:col
.Alternative solutions for
getproperty
:getproperty(df, x)
to have the same behavior asdf.col
.select!(df, x)
to have the same behavior.Alternative solutions for `setproperty!:
setcol!(df, x, v)
will work. However I don't like it because now you have to worry about an extra parentheses.Is complicated enough as is, without putting it in
setcol!(df, x, ...)
.Finally I like the symmetry of
getproperty
andsetproperty
looking the same.df.col = f.(df.col)
is elegant.Solutions:
Ideally:
df.$x = v
would be valid Julia syntax, where the$
escapes whatever is following it.struct
and overloadgetproperty(df, s::ColumnSelector)
. This is not possible, however, because Julia doesn't evaluate the expression after the dot at all. This doesn't work.Pragmatically:
I'm at a bit of a dead end in terms of ideas for this. But I do think it's important. The average scientist codes in global scope, and we aren't going to get them to put things in functions if they have to re-write all their code in a less intuitive way in order to do that.
Thanks for reading my rant.