JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.71k stars 360 forks source link

length(::DataFrame) returns number of columns #1200

Closed omus closed 5 years ago

omus commented 7 years ago

Currently calling length on a DataFrame returns the number of columns. This is strange as length usually returns the number of elements.

omus commented 7 years ago

cc: @ararslan @andyferris

ararslan commented 7 years ago

Yeah this is really weird. We probably shouldn't define length at all.

kmsquire commented 7 years ago

There was debate about this when it was first added, mostly between those coming from an R background (to whom, I think, the current definition made sense) and those coming from a Pandas background (where length is the number of rows). So what makes the most sense probably depends on what you've used before.

ararslan commented 7 years ago

Having it be inconsistent between languages is another reason not to define it here, IMO. Then it confuses no one. 🙂

andyferris commented 7 years ago

If we want to think of a dataframe in the relational algebra sense (as a collection of named tuples, i.e. rows), then iterating over rows and having length for the number of rows makes sense to me.

There has been a lot of discussion about this surrounding Jeff's NamedTuple pull request (partly because it is infrastructure for making such iteration fast).

rofinn commented 7 years ago

Given that more descriptive methods such as size, nrow and ncol exist (could be better documented though) I don't really see a reason to keep length if there's a debate about what it should return.

andyferris commented 7 years ago

It goes with iteration, so if you can't iterate a DataFrame then you shouldn't have a length.

rofinn commented 7 years ago

I'm not sure length even needs to go with iteration. For example, we can iterate over a Channel which doesn't provide a length method either.

ararslan commented 7 years ago

length is an optional part of the iteration protocol, per the documentation. I guess we could have length defined on the EachRow or whatever iterator types we define for rows/columns, though it doesn't really seem useful there.

nalimilan commented 7 years ago

length(df) is consistent with the fact that df[1] returns the first column. We could remove both and require writing df[:, 1].

nrow and ncol should probably be deprecated too, cf. https://github.com/JuliaStats/DataFrames.jl/issues/406.

rofinn commented 7 years ago

length(df) is consistent with the fact that df[1] returns the first column. We could remove both and require writing df[:, 1].

Yeah, I recall that confusing me the first time I used dataframes cause I figured df[1] would give me the first row.

ararslan commented 7 years ago

Okay, so the plan as I understand it:

  1. Deprecate length in favor of nothing
  2. Deprecate linear indexing into a DataFrame in favor of two indices
  3. Deprecate nrow/ncol in favor of size
nalimilan commented 7 years ago

Actually I'm afraid removing the df[:a] syntax would be too annoying. We have even considered supporting df.a once/if getfield can be overloaded. Don't both R and Pandas support it?

rofinn commented 7 years ago

Don't both R and Pandas support it?

Yes, but pandas determines whether that is a row or col based on what you give it.

>>> df = pandas.DataFrame({ 'A' : 1., 'B' : pandas.Series(1,index=list(range(4)),dtype='float32'),})
>>> df
     A    B
0  1.0  1.0
1  1.0  1.0
2  1.0  1.0
3  1.0  1.0
>>> df[:1]
     A    B
0  1.0  1.0
>>> df["A"]
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

If we restricted column names to Symbols (or automatically converted) then we could always return columns for Symbol and row for Int?

ararslan commented 7 years ago

Actually I'm afraid removing the df[:a] syntax would be too annoying.

By "linear indexing," I meant specifically with a number. It's not immediately obvious whatdf[1] is, but df[:a] is perfectly clear.

nalimilan commented 7 years ago

Interesting. Honestly, I find Pandas' behavior really confusing: returning either a row or a column depending on the argument type is too clever for my taste. We could stop supporting df[1], since it's indeed less explicit than df[:a], but I'm not sure it would really improve things. At least for now it's consistent with how NamedArray works and how NamedTuple will work, and it reflects the fact that columns are ordered.

OTOH we can deprecate nrow/ncol independently of this issue.

quinnj commented 6 years ago

Ok, PR up at https://github.com/JuliaStats/DataFrames.jl/pull/1224/. Deprecates length, nrow, and ncol in favor of size. Bit of a pain, but hopefully will be cleaner and simpler going forward.

Wikunia commented 6 years ago

@nalimilan I love pandas for being that clever :smile: There is some stuff which just seems weird to me in DataFrames.jl Actually there are a lot more people working with pandas than with DataFrames. Maybe it's not the worst choice to be compatible for people who have experience with pandas which I suppose are a lot of people.

nalimilan commented 6 years ago

Actually there are a lot more people working with pandas than with DataFrames. Maybe it's not the worst choice to be compatible for people who have experience with pandas which I suppose are a lot of people.

The policy general followed by Julia packages is to try to find a consistent design which makes sense for users once they are familiar with the package. We don't generally support features just because they sound "natural" to people used to other software (but of course we prefer being consistent when that doesn't hurt). Also there are lots of people coming from other software (e.g. R/dplyr/data.table), and what they find "natural" is often mutually exclusive.

I think the way forward here is that once field overloading is available in Base (https://github.com/JuliaLang/julia/pull/24960), we deprecate df[:col] in favor of df.col, so that length(df) can be deprecated in favor of size(df, 1) or nrow(df). Then we can discuss whether df[1] should be an error or whether it should return the first row, in which case iterating over a DataFrame should also return rows (as NamedTuple objects).

Wikunia commented 6 years ago

First of all I agree that overloading will make it easier and the general policy is reasonable. I'm wondering whether it is necessary to not support df[:col] anymore. I think it doesn't harm anyone if it works. nrow and ncol seem to be nice in my opinion also length and width would work ;) df[1] is probably a bit more challenging than df[:col] as there might be two different outcomes.

nickeubank commented 6 years ago

I think the way forward here is that once field overloading is available in Base (JuliaLang/julia#24960), we deprecate df[:col] in favor of df.col, so that length(df) can be deprecated in favor of size(df, 1) or nrow(df).

Does assigning to a field work in julia? e.g. can one still do:

df = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"]) 
df.c = 1:3

the way one can now do

df = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"]) 
df[:c] = 1:3

? I know in pandas that created a real gotcha -- you can pull a column with the dot-notation, but you couldn't set using it. If you try, it created a new property, but not a column, and then you couldn't find it again...

Also, in a similar vein, note that the dot-field notation causes problems with spaces in column names that are easier to address with the current df[Symbol("First Name")] type notation.

pdeffebach commented 6 years ago

Deprecating df[:a] wouldn't be great because then you would have to replace df[x] with getfield(df, x) if x = :a.

nalimilan commented 6 years ago

On Julia 0.7 you can use df.c = 1:3 on current DataFrames master. But indeed that doesn't completely replace df[col]/df[:, col] for situations where col isn't a literal symbol without spaces. The question is then: is it OK to deprecate df[col] in favor of df[:, col], or is it too annoying for these cases?

pdeffebach commented 6 years ago

I feel like i rarely work with the symbols themselves. All of my cleaning is in for loops or functions. So the easier it is to refer to columns with a variable the better.

nickeubank commented 6 years ago

IMHO I'm of a similar view as @pdeffebach.

My view is that (a) pulling out one column is common enough we need a compact way to do it, and (b) I don't think the dot-field notation is a good substitute for the square-bracket-column-symbol notation.

The problem, in my view, is that dot-field notation is fine for objects with stable field names (like graph.vertices in a graph object), but given that column names are inherently unstable in DataFrames, I don't like the dot-field notation because it encourages non-generalizable code because you have to hard-code the field names into your code. Seems contrary to Julia styles guidelines. And I don't like the idea of having one syntax for one's own scripts and another for generalizable code.

So I think we should keep support for df[x] / df[:colname] / df[[:colname]] etc. I'm fine with the view above we should stop supporting numeric indexing into the columns this way (e.g. df[2]) and boolean indexing (df[[true, false, false]]), but I think just keeping "pass symbols, get columns" for square brackets is unlikely to cause confusion. I think if we also had column names (as in pandas) I agree it might be confusing, but as symbols only refer to columns in DataFrames, I think it's pretty clear.

nalimilan commented 6 years ago

If we support df[:colname], we may as well support df[1]. It would be weird to reject integers for this syntax but not for df[:, 1], just because Pandas happens to do something completely weird. Also, the similarity with NamedTuple is appealing.

nickeubank commented 6 years ago

OK -- I'm totally ok with using square-brackets as "indexing into columns". I just meant I have stronger feelings about losing ability to use symbols than losing ability to do numeric indexing into columns. @nalimilan You've sold me on not doing something pandas-like with sometimes-row-indexing. :)

(EDITS: lots of sloppy typos)

pdeffebach commented 5 years ago

I've been playing around with a rowwise command that applies a function to each row of a dataframe, returning a vector of length nrow(df), like stata's egen x = rowmean(v1 v2...).

With the way dataframes is set up, it's difficult to make this performant, since we will have to collect (maybe not with collect) each row, and rows may have heterogenous types. mapslices, which acts on matrices, is very fast, on the other hand.

This is fine, because row-wise operations, while I think important enough to live in DataFrames, are relatively uncommon, and DataFrame's structure is well-optimized for column-oriented operations, which is the dominant use-case.

I guess my point is that if people expect something that acts on rows to be as easy and fast as mapslices, they are going to be frustrated. So its better to have an API that differentiates itself more from generic matrix-like functions. In the end, this is just a vote for nrow and ncol instead of size, but the principal can apply more broadly.