JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Deprecate `DataFrame(::AbstractMatrix)` #2433

Closed rofinn closed 4 years ago

rofinn commented 4 years ago

I feel like it's rare that folks actually want to call this constructor without specifying column names. Also, it isn't that hard to do this with the tables interface via DataFrame(Tables.table(X)).

julia> DataFrame(Tables.table(x))
10×5 DataFrame
│ Row │ Column1  │ Column2  │ Column3   │ Column4  │ Column5   │
│     │ Float64  │ Float64  │ Float64   │ Float64  │ Float64   │
├─────┼──────────┼──────────┼───────────┼──────────┼───────────┤
│ 1   │ 0.829388 │ 0.100101 │ 0.733667  │ 0.591796 │ 0.538582  │
│ 2   │ 0.573906 │ 0.402706 │ 0.41509   │ 0.899583 │ 0.482383  │
│ 3   │ 0.703773 │ 0.729495 │ 0.375752  │ 0.782216 │ 0.425253  │
│ 4   │ 0.5717   │ 0.93305  │ 0.0922584 │ 0.211547 │ 0.123723  │
│ 5   │ 0.927349 │ 0.654163 │ 0.11306   │ 0.452749 │ 0.921814  │
│ 6   │ 0.514157 │ 0.891617 │ 0.481489  │ 0.368915 │ 0.718185  │
│ 7   │ 0.717906 │ 0.627349 │ 0.371058  │ 0.958308 │ 0.925878  │
│ 8   │ 0.104714 │ 0.995778 │ 0.671143  │ 0.410917 │ 0.547495  │
│ 9   │ 0.373802 │ 0.147139 │ 0.425674  │ 0.893018 │ 0.937846  │
│ 10  │ 0.26242  │ 0.30546  │ 0.750857  │ 0.196515 │ 0.0470533 │

julia> DataFrame(x)
10×5 DataFrame
│ Row │ x1       │ x2       │ x3        │ x4       │ x5        │
│     │ Float64  │ Float64  │ Float64   │ Float64  │ Float64   │
├─────┼──────────┼──────────┼───────────┼──────────┼───────────┤
│ 1   │ 0.829388 │ 0.100101 │ 0.733667  │ 0.591796 │ 0.538582  │
│ 2   │ 0.573906 │ 0.402706 │ 0.41509   │ 0.899583 │ 0.482383  │
│ 3   │ 0.703773 │ 0.729495 │ 0.375752  │ 0.782216 │ 0.425253  │
│ 4   │ 0.5717   │ 0.93305  │ 0.0922584 │ 0.211547 │ 0.123723  │
│ 5   │ 0.927349 │ 0.654163 │ 0.11306   │ 0.452749 │ 0.921814  │
│ 6   │ 0.514157 │ 0.891617 │ 0.481489  │ 0.368915 │ 0.718185  │
│ 7   │ 0.717906 │ 0.627349 │ 0.371058  │ 0.958308 │ 0.925878  │
│ 8   │ 0.104714 │ 0.995778 │ 0.671143  │ 0.410917 │ 0.547495  │
│ 9   │ 0.373802 │ 0.147139 │ 0.425674  │ 0.893018 │ 0.937846  │
│ 10  │ 0.26242  │ 0.30546  │ 0.750857  │ 0.196515 │ 0.0470533 │

The benefit with dropping this constructor is that it would allow more inputs to fall back to the tables constructor, including array types that implement the tables interface like AxisKeys.jl. Currently, you need to special case DataFrame(A) to DataFrame(Tables.columns(A)) only when A is 2 dimensional.

bkamins commented 4 years ago

This is breaking and also there are a lot of scenarios when this is used. However, I think we can do one of two things:

  1. deprecate DataFrame(::AbstractMatrix) and define only DataFrame(::Matrix) quite safely, which seems to resolve your issues (but I do not really like it).
  2. in the constructor do istable check, and if it returns true use another code path (this would be OK)
  3. define DataFrame(x::AbstractMatrix) = DataFrame(Tables.table(x)) and require that if matrices want to have a special handling then they override Tables.table method (I think this would be cleanest and best)

Now regarding AxisKeys.jl on master it fails to pass Tables.istable test if an object (not type) is passed (which is a bug that should be fixed there). Also maybe it should have a method for Tables.table for matrices? (so that the option 3. would work?).

What would be your preference here?

rofinn commented 4 years ago

Yeah, I think (1) is just trading one special case for another which defeats the point. From a user perspective I'm largely indifferent with (2) and (3). If you prefer (3) then I'm fine with adding a Tables.table method to AxisKeys.jl, as long as that's the intended use case. Currently, it only seems to be defined for AbstractMatrix and returns MatrixTable.

Now regarding AxisKeys.jl on master it fails to pass Tables.istable test if an object (not type) is passed (which is a bug that should be fixed there).

I don't think that's true. I think Tables.jl handles that case for you if you've already registered the type.

julia> typeof(C)
KeyedArray{Float64,3,NamedDimsArray{(:time, :loc, :id),Float64,3,Array{Float64,3}},Tuple{Array{Date,1},Array{Int64,1},Array{String,1}}}

julia> @which Tables.istable(C)
istable(x::T) where T in Tables at /Users/rory/.julia/packages/Tables/Eti9i/src/Tables.jl:255
bkamins commented 4 years ago

OK - I will then go for option 2, so you do not have to do anything 😄.

Except, the problem I reported:

julia> x = KeyedArray(rand(Int8, 2,10), ([:a, :b], 10:10:100));

julia> using Tables

julia> Tables.istable(typeof(x))
true

julia> Tables.istable(x)
false

The reason is the following definition in Tables.jl:

istable(::AbstractMatrix) = false

which is not overriden in AxisKeys.jl.

rofinn commented 4 years ago

Alright, thank you 👍 Again, I'm really not bothered if you'd prefer (3) from a maintainability standpoint. Oh, sorry, I didn't realize there was a special case fallback for specifically istable(::AbstractMatrix). Do you know why that definition is needed?

bkamins commented 4 years ago

I just think @quinnj added it. I am OK with going for option (2) so let us go this way. I will ping you in the PR

nalimilan commented 4 years ago

It sounds a bit weird to do something completely different for some AbstractMatrix inputs. The point of interfaces is that any AbstractMatrix is supposed to behave the same... What if a function called DataFrame(x) on a matrix provided by the user, and expected to get the equivalent of DataFrame(convert(Matrix, x))?

EDIT: to complement this, KeyedArrays' README says

As for NamedDims.jl, the guiding idea is that every operation which could be done on ordinary arrays should still produce the same data, but propagate the extra information (names/keys), and error if it conflicts.

Also see previous discussion on similar issue in NamedArrays: https://github.com/davidavdav/NamedArrays.jl/issues/76.

bkamins commented 4 years ago

@nalimilan - I think there are two issues:

  1. what to do in DataFrames.jl - I think it is OK, to do what I proposed as it allows to populate column names
  2. what KeyedArrays.jl should do (but this is a separate discussion for KeyedArrays.jl): it should not reshape the matrix, but pass it as a matrix and just set the column names properly
nalimilan commented 4 years ago

Yes, right. I think your PR is right on its own (as I guess we can consider that the fact that columns names are x1, etc. isn't part of the contract for AbstractMatrix inputs, it's just a fallback when no better names are available). Though note that retaining column names while discarding row names isn't the most likely use case: it could be useful to have an argument to choose to copy row names to a column, which isn't possible to implement with the existing Tables.jl API AFAIK.

Now the problem regarding 2. is that AFAICT it's perfectly legitimate for AxisKeys to implement the Tables.jl interface as they do (reshaping), since Tables.jl doesn't treat AbstractMatrix objects in general as tables. The problem is in DataFrames, with the interaction between our AbstractMatrix constructors and our Tables.jl constructors. I don't see a good solution to this apart from deprecating DataFrame(::AbstractMatrix) as @rofinn proposed. But that would be annoying.

bkamins commented 4 years ago

we have Matrix(::AbstractDataFrame) in DataFrames.jl. I think we should support AbstractMatrix by default symmetrically.

Given your comment I would think that what I propose in #2435 is OK (i.e. by default do what we did in the past, but if some AbstractMatrix type opts-in to be a Tables.jl table then we just do what this type specifies).

nalimilan commented 4 years ago

Sorry, but after thinking more about this I'm really not happy with the fact that DataFrame(x::KeyedMatrix) would give a completely different result from DataFrame(x::Matrix). Imagine you have a KeyedMatrix and you would expect to get the equivalent of DataFrame(Matrix(x)). It would be really weird to give you another result -- and given my experience in R I think that's a reasonable use case. Note that ideally when we'll have agreed on the standard type for "named array", we will use KeyedArray much more widely, like in R. And other types like sparse matrices might start being considered as tables (with columns i, j and value).

Consistency and predictability are among our strengths, so it's worth trying to preserve these. So I think we need a way to make our API unambiguous about the meaning of the input argument. I'm not sure yet what's the best solution though. It could be using Tables.table or an argument to the constructors.

As I said, I think it would also be useful to have a way to insert a column containing the row names, and use the KeyedArray column names without reshaping the data at all. This doesn't have to be supported right now but it's worth keeping this in mind.

bkamins commented 4 years ago

So in short you say: let us deprecate DataFrame(:;AbstractMatrix) and DataFrame(:;AbstractMatrix, colnames) methods and require users to use Tables.table wrapper (unless their AbstractMatrix passes istable when user can just create it). Is this correct? This is the approach Tables.jl uses. Do I get you correctly?

(note that also DataFrame(collect(eachcol(x))) works and actually this would be a deprecation message)

If there is a consensus we prefer this (CC @oxinabox @pdeffebach) then I will not object, but this will create a small inconvenience.

For me - in general - it is natural that a matrix is converted to a data frame the way we do it currently.

oxinabox commented 4 years ago

I don't think of a matrix as collection of columns (or as a collection of rows) but rather as its own thing, without a natural division into components except its elements via linear indexing.

DataFrame(::AbstractMatrix) is extra weird and i doubly any one really uses it beyond as a toy. Since it jsut makes up random column names out of nothing

DataFrame(::AbstractMatrix, colnames) is less weird but still weird because of the afformentioned reasons. I think DataFrame(eachcol(matrix), colnames) or DataFrame(colnames .=> eachcol(matrix))) is much more natural. Also naturally extends to eachrow which is good if you are storing 1 column per observation (which can be good for performance reasons).

bkamins commented 4 years ago

Fair enough. If there will be no more comments on this I will deprecate both methods (with and without column names passed).

In particular I agree that:

DataFrame(colnames .=> eachcol(matrix)))

and

DataFrame(collect(eachcol(matrix)))

are explicit.

nalimilan commented 4 years ago

Regarding the DataFrame(eachcol(matrix)) replacement, maybe DataFrame(Tables.table(matrix)) is easier to explain/understand than DataFrame(collect(eachcol(matrix)))?

I don't have a strong opinion regarding DataFrame(::AbstractMatrix, colnames). DataFrame(eachcol(matrix), colnames) would be a nice replacement, if we can widen the signature for the first argument.

bkamins commented 4 years ago

if we can widen the signature for the first argument

We cannot widen the signature as Generator does not have eltype information and eachcol unfortunately uses a generic generator so we cannot specialize on it (actually we might make a PR to Julia base to change this, as knowing that the object is actually generated by eachcol would be nice. We could even think of making it an immutable <:AbstractVector).

DataFrame(Tables.table(matrix)) is easier to explain/understand

It is, but it produces different column names, therefore the deprecation message should be with collect.

pdeffebach commented 4 years ago

I think DataFrame(matrix) is useful. I certainly use it when debugging a lot, which isn't a good reason to keep it. With regards to uses, I definitely use it in dplyr. Here is some code I'm working on now.

  P_S_df = as_tibble(P_S, .name_repair = "unique")
  names(P_S_df) = paste("P_S", 1:n_ind, sep = "_")

This comes from doing a bunch of calculations where everything is a matrix, for model fitting, and then turning the big matrices back to data-frames to report results. This seems like a common use-case, no?

nalimilan commented 4 years ago

We cannot widen the signature as Generator does not have eltype information and eachcol unfortunately uses a generic generator so we cannot specialize on it (actually we might make a PR to Julia base to change this, as knowing that the object is actually generated by eachcol would be nice. We could even think of making it an immutable <:AbstractVector).

Yes, I was only thinking of allowing ::Any when the second argument is a Vector{Symbol}. But maybe that's too confusing. (I think having a special EachCol or EachSlice object in Base has been discussed, but since the design isn't completely settled yet the conclusion was that it was safer to use a generator for now so that people don't rely on a specific implementation.)

@pdeffebach Nobody is proposing to remove the ability to build a data frame from a matrix -- only to use a slightly more verbose syntax. Does that sounds OK to you? The dplyr example you give sounds quite verbose too. :-p

bkamins commented 4 years ago

The only problem is that using Tables.table requires the header argument (which is kwarg) to be a vector of Vector{Symbol}, while we accept strings and and AbstractVectors.

The alternative route would be to say that DataFrame constructor behaves like Tables.table for matrices (this is essentially what we do), and say that if "by chance" an AbstractMatrix implements Tables.jl interface you have to use DataFrame(Tables.columns(A)) - this is not super long either.

We have such a situation in other cases already, e.g.:

julia> DataFrame([:a => [1,2,3], :b => [4,5,6]])
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> DataFrame(Tables.columns([:a => [1,2,3], :b => [4,5,6]]))
2×2 DataFrame
│ Row │ first  │ second    │
│     │ Symbol │ Array…    │
├─────┼────────┼───────────┤
│ 1   │ a      │ [1, 2, 3] │
│ 2   │ b      │ [4, 5, 6] │

EDIT: the question is what happens more often: wanting matrix to be converted into DataFrame in a legacy way, or having a matrix that has some custom layout when converted to a DataFrame?

pdeffebach commented 4 years ago

If others are on board then I think this is okay. I'm wary of exposing a Tables.* function to the user, though. The only other time we do this is when we recommend Tables.namedtupleiterator, but that's only for "advanced" users looking for a faster eachrow.

So I have a preference for collect(eachcol(mat))

nalimilan commented 4 years ago

The only problem is that using Tables.table requires the header argument (which is kwarg) to be a vector of Vector{Symbol}, while we accept strings and and AbstractVectors.

We could change that in Tables: accepting strings sounds reasonable. But as long as we don't use it for the deprecation it's not the end of the world if it's not a perfect replacement.

Anyway I agree that we should weigh the pros and cons of each solution, depending on which case is the most commonly needed. However, apart from that criterion, I think the most consistent approach is to treat as many inputs as possible as Tables.jl objects. This isn't possible in all cases as @bkamins showed, because of our basic constructors, but we can reduce to a minimum the number of exceptions, so that you know that when you have a Tables.jl object is will be treated as such, whatever its exact type.

bkamins commented 4 years ago

so that you know that when you have a Tables.jl object is will be treated as such, whatever its exact type

Agreed, but still we will have "some" exceptions and the question is if AbstractMatrix is a justifiable exception (also keeping in mind that it would be breaking to remove this support - so if we think it is an edge case - then maybe it is better to be non-breaking). In general - if someone wants to be safe using Tables.columns is the way to ensure all is processed correctly.

oxinabox commented 4 years ago

DataFrame(colnames.=>eachcol(matrix)) and DataFrame(collect(eachcol(matrix))), and DataFrame(table) seem like enough. that seems like all the constructors we need are: Either it is:

This are more or less fully distinct and clear

I don't think we need to mention using Tables.jl on a Matrix in the first place, since that is just a special case of the ubiquitious table sink constructor and would be assumed to work by anyone who even knows what tables are.

pdeffebach commented 4 years ago

I don't think we need to mention using Tables.jl on a Matrix in the first place, since that is just a special case of the ubiquitious table sink constructor and would be assumed to work by anyone who even knows what tables are.

Thats a good point I hadn't thought of. We can just throw an error and people can be aware of what a table is and figure it out if they want a matrix. I'm on board.

a AbstractVector of columns

Constructing names automatically is annoying, so its nice to provide a way to construct a data frame where you dont care about the names.

bkamins commented 4 years ago

Let me give you a full menagerie that we support. I divided it into three groups:

Will stay

Can stay (rarely useful, but do not hurt)

Last time to decide to drop it (these are most disputable)

rofinn commented 4 years ago

FWIW, that seems reasonable to me... though I mostly only use the pairs and tables constructors. Shouldn't this constructor be addressed by the tables interface?

DataFrame(x::AbstractArray{NamedTuple{names,T},1}; copycols) where {names, T} # vector of named tuples

Isn't that just a rowtable?

quinnj commented 4 years ago

Yes; that is indeed handled by Tables.jl interface; it's just needed to avoid dispatching to some of the other ::AbstractVector constructors

nalimilan commented 4 years ago

Thanks for the list. So in the "disputable" list we have a series of constructors that accept tuples, similar to constructors that accept vectors, right? Not sure how useful they are but they probably don't do much harm either.

AFAICT, the only ones conflicting with the interpretation of the input as a Table are the AbstractMatrix ones that we're discussing here, right?

pdeffebach commented 4 years ago

I would deprecate all the ones in the "maybe" category except for Dict.

bkamins commented 4 years ago

So my thinking and understanding is:

  1. no one really sees big usefulness of AbstractMatrix constructors and we could drop them (it would be great to have an agreement to have https://github.com/JuliaData/Tables.jl/issues/199 then as current signature of Tables.table is too restrictive in comparison to what we allow in DataFrames.jl)
  2. I put Tuple constructors in "to drop" as no one ever asked me about them (so probably they are not useful and it is easy enough to convert a tuple to a vector if really needed); I would vote to remove them then, as it will simplify the API (list of things to learn and maintain)
  3. The eltype constructors are a different case. Actually here I regularly get questions how to create an uninitialized data frame with certain eltypes of columns (people populate these columns afterwards); one can of course write e.g. DataFrame([T[] for T in [Int, Float64, Char]]) but I am not sure if this is obvious for a typical user
  4. Other constructors stay

So I feel that only 3. is disputable (and that is why I put it in the second group). Can you give some more feedback here given these considerations please?

oxinabox commented 4 years ago

I would get rid of the eltype one also. The more constructors we get rid of means there are less to look for.

I think DataFrame(cost=Decimal[],. product=String[], weight=Float64[])

Is clearest. But also

df=DataFrames()
push!(df, (cost=dec"1.20", product="drink", weight=550.0))

works (right?) and often is what they really wanted.

bkamins commented 4 years ago

works (right?) and often is what they really wanted.

works, we call it pseudo-broadcasting, https://bkamins.github.io/julialang/2020/09/13/pseudobroadcasting.html.


OK - so I will wait for 1-2 weeks for more feedback. If nothing changes we also prune the eltype constructors (I agree they are not idiomatic to use).

bkamins commented 4 years ago

See https://github.com/JuliaData/DataFrames.jl/pull/2464 for a fix.