Closed omus closed 5 years ago
cc: @ararslan @andyferris
Yeah this is really weird. We probably shouldn't define length
at all.
There was debate about this when it was first added, mostly between those coming from an R background (to whom, I think, the current definition made sense) and those coming from a Pandas background (where length is the number of rows). So what makes the most sense probably depends on what you've used before.
Having it be inconsistent between languages is another reason not to define it here, IMO. Then it confuses no one. 🙂
If we want to think of a dataframe in the relational algebra sense (as a collection of named tuples, i.e. rows), then iterating over rows and having length
for the number of rows makes sense to me.
There has been a lot of discussion about this surrounding Jeff's NamedTuple
pull request (partly because it is infrastructure for making such iteration fast).
Given that more descriptive methods such as size
, nrow
and ncol
exist (could be better documented though) I don't really see a reason to keep length
if there's a debate about what it should return.
It goes with iteration, so if you can't iterate a DataFrame
then you shouldn't have a length
.
I'm not sure length
even needs to go with iteration. For example, we can iterate over a Channel
which doesn't provide a length
method either.
length
is an optional part of the iteration protocol, per the documentation. I guess we could have length
defined on the EachRow
or whatever iterator types we define for rows/columns, though it doesn't really seem useful there.
length(df)
is consistent with the fact that df[1]
returns the first column. We could remove both and require writing df[:, 1]
.
nrow
and ncol
should probably be deprecated too, cf. https://github.com/JuliaStats/DataFrames.jl/issues/406.
length(df)
is consistent with the fact thatdf[1]
returns the first column. We could remove both and require writingdf[:, 1]
.
Yeah, I recall that confusing me the first time I used dataframes cause I figured df[1]
would give me the first row.
Okay, so the plan as I understand it:
length
in favor of nothingDataFrame
in favor of two indicesnrow
/ncol
in favor of size
Actually I'm afraid removing the df[:a]
syntax would be too annoying. We have even considered supporting df.a
once/if getfield
can be overloaded. Don't both R and Pandas support it?
Don't both R and Pandas support it?
Yes, but pandas determines whether that is a row or col based on what you give it.
>>> df = pandas.DataFrame({ 'A' : 1., 'B' : pandas.Series(1,index=list(range(4)),dtype='float32'),})
>>> df
A B
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
>>> df[:1]
A B
0 1.0 1.0
>>> df["A"]
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
If we restricted column names to Symbol
s (or automatically converted) then we could always return columns for Symbol
and row for Int
?
Actually I'm afraid removing the
df[:a]
syntax would be too annoying.
By "linear indexing," I meant specifically with a number. It's not immediately obvious whatdf[1]
is, but df[:a]
is perfectly clear.
Interesting. Honestly, I find Pandas' behavior really confusing: returning either a row or a column depending on the argument type is too clever for my taste. We could stop supporting df[1]
, since it's indeed less explicit than df[:a]
, but I'm not sure it would really improve things. At least for now it's consistent with how NamedArray
works and how NamedTuple
will work, and it reflects the fact that columns are ordered.
OTOH we can deprecate nrow
/ncol
independently of this issue.
Ok, PR up at https://github.com/JuliaStats/DataFrames.jl/pull/1224/. Deprecates length
, nrow
, and ncol
in favor of size
. Bit of a pain, but hopefully will be cleaner and simpler going forward.
@nalimilan I love pandas for being that clever :smile: There is some stuff which just seems weird to me in DataFrames.jl Actually there are a lot more people working with pandas than with DataFrames. Maybe it's not the worst choice to be compatible for people who have experience with pandas which I suppose are a lot of people.
Actually there are a lot more people working with pandas than with DataFrames. Maybe it's not the worst choice to be compatible for people who have experience with pandas which I suppose are a lot of people.
The policy general followed by Julia packages is to try to find a consistent design which makes sense for users once they are familiar with the package. We don't generally support features just because they sound "natural" to people used to other software (but of course we prefer being consistent when that doesn't hurt). Also there are lots of people coming from other software (e.g. R/dplyr/data.table), and what they find "natural" is often mutually exclusive.
I think the way forward here is that once field overloading is available in Base (https://github.com/JuliaLang/julia/pull/24960), we deprecate df[:col]
in favor of df.col
, so that length(df)
can be deprecated in favor of size(df, 1)
or nrow(df)
. Then we can discuss whether df[1]
should be an error or whether it should return the first row, in which case iterating over a DataFrame
should also return rows (as NamedTuple
objects).
First of all I agree that overloading will make it easier and the general policy is reasonable. I'm wondering whether it is necessary to not support df[:col]
anymore. I think it doesn't harm anyone if it works.
nrow
and ncol
seem to be nice in my opinion also length
and width
would work ;)
df[1]
is probably a bit more challenging than df[:col]
as there might be two different outcomes.
I think the way forward here is that once field overloading is available in Base (JuliaLang/julia#24960), we deprecate df[:col] in favor of df.col, so that length(df) can be deprecated in favor of size(df, 1) or nrow(df).
Does assigning to a field work in julia
? e.g. can one still do:
df = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"])
df.c = 1:3
the way one can now do
df = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"])
df[:c] = 1:3
? I know in pandas
that created a real gotcha -- you can pull a column with the dot-notation, but you couldn't set using it. If you try, it created a new property, but not a column, and then you couldn't find it again...
Also, in a similar vein, note that the dot-field notation causes problems with spaces in column names that are easier to address with the current df[Symbol("First Name")]
type notation.
Deprecating df[:a]
wouldn't be great because then you would have to replace df[x]
with getfield(df, x)
if x = :a
.
On Julia 0.7 you can use df.c = 1:3
on current DataFrames master. But indeed that doesn't completely replace df[col]
/df[:, col]
for situations where col
isn't a literal symbol without spaces. The question is then: is it OK to deprecate df[col]
in favor of df[:, col]
, or is it too annoying for these cases?
I feel like i rarely work with the symbols themselves. All of my cleaning is in for
loops or functions. So the easier it is to refer to columns with a variable the better.
IMHO I'm of a similar view as @pdeffebach.
My view is that (a) pulling out one column is common enough we need a compact way to do it, and (b) I don't think the dot-field notation is a good substitute for the square-bracket-column-symbol notation.
The problem, in my view, is that dot-field notation is fine for objects with stable field names (like graph.vertices
in a graph object), but given that column names are inherently unstable in DataFrames, I don't like the dot-field notation because it encourages non-generalizable code because you have to hard-code the field names into your code. Seems contrary to Julia styles guidelines. And I don't like the idea of having one syntax for one's own scripts and another for generalizable code.
So I think we should keep support for df[x]
/ df[:colname]
/ df[[:colname]]
etc. I'm fine with the view above we should stop supporting numeric indexing into the columns this way (e.g. df[2]
) and boolean indexing (df[[true, false, false]]
), but I think just keeping "pass symbols, get columns" for square brackets is unlikely to cause confusion. I think if we also had column names (as in pandas
) I agree it might be confusing, but as symbols only refer to columns in DataFrames, I think it's pretty clear.
If we support df[:colname]
, we may as well support df[1]
. It would be weird to reject integers for this syntax but not for df[:, 1]
, just because Pandas happens to do something completely weird. Also, the similarity with NamedTuple
is appealing.
OK -- I'm totally ok with using square-brackets as "indexing into columns". I just meant I have stronger feelings about losing ability to use symbols than losing ability to do numeric indexing into columns. @nalimilan You've sold me on not doing something pandas-like with sometimes-row-indexing. :)
(EDITS: lots of sloppy typos)
I've been playing around with a rowwise
command that applies a function to each row of a dataframe, returning a vector of length nrow(df)
, like stata's egen x = rowmean(v1 v2...)
.
With the way dataframes is set up, it's difficult to make this performant, since we will have to collect (maybe not with collect
) each row, and rows may have heterogenous types. mapslices
, which acts on matrices, is very fast, on the other hand.
This is fine, because row-wise operations, while I think important enough to live in DataFrames, are relatively uncommon, and DataFrame's structure is well-optimized for column-oriented operations, which is the dominant use-case.
I guess my point is that if people expect something that acts on rows to be as easy and fast as mapslices
, they are going to be frustrated. So its better to have an API that differentiates itself more from generic matrix-like functions. In the end, this is just a vote for nrow
and ncol
instead of size
, but the principal can apply more broadly.
Currently calling
length
on aDataFrame
returns the number of columns. This is strange aslength
usually returns the number of elements.