Closed rofinn closed 4 years ago
This is breaking and also there are a lot of scenarios when this is used. However, I think we can do one of two things:
DataFrame(::AbstractMatrix)
and define only DataFrame(::Matrix)
quite safely, which seems to resolve your issues (but I do not really like it).istable
check, and if it returns true
use another code path (this would be OK)DataFrame(x::AbstractMatrix) = DataFrame(Tables.table(x))
and require that if matrices want to have a special handling then they override Tables.table
method (I think this would be cleanest and best)Now regarding AxisKeys.jl on master it fails to pass Tables.istable
test if an object (not type) is passed (which is a bug that should be fixed there). Also maybe it should have a method for Tables.table
for matrices? (so that the option 3. would work?).
What would be your preference here?
Yeah, I think (1) is just trading one special case for another which defeats the point. From a user perspective I'm largely indifferent with (2) and (3). If you prefer (3) then I'm fine with adding a Tables.table
method to AxisKeys.jl, as long as that's the intended use case. Currently, it only seems to be defined for AbstractMatrix
and returns MatrixTable
.
Now regarding AxisKeys.jl on master it fails to pass Tables.istable test if an object (not type) is passed (which is a bug that should be fixed there).
I don't think that's true. I think Tables.jl handles that case for you if you've already registered the type.
julia> typeof(C)
KeyedArray{Float64,3,NamedDimsArray{(:time, :loc, :id),Float64,3,Array{Float64,3}},Tuple{Array{Date,1},Array{Int64,1},Array{String,1}}}
julia> @which Tables.istable(C)
istable(x::T) where T in Tables at /Users/rory/.julia/packages/Tables/Eti9i/src/Tables.jl:255
OK - I will then go for option 2, so you do not have to do anything 😄.
Except, the problem I reported:
julia> x = KeyedArray(rand(Int8, 2,10), ([:a, :b], 10:10:100));
julia> using Tables
julia> Tables.istable(typeof(x))
true
julia> Tables.istable(x)
false
The reason is the following definition in Tables.jl:
istable(::AbstractMatrix) = false
which is not overriden in AxisKeys.jl.
Alright, thank you 👍 Again, I'm really not bothered if you'd prefer (3) from a maintainability standpoint. Oh, sorry, I didn't realize there was a special case fallback for specifically istable(::AbstractMatrix)
. Do you know why that definition is needed?
I just think @quinnj added it. I am OK with going for option (2) so let us go this way. I will ping you in the PR
It sounds a bit weird to do something completely different for some AbstractMatrix
inputs. The point of interfaces is that any AbstractMatrix
is supposed to behave the same... What if a function called DataFrame(x)
on a matrix provided by the user, and expected to get the equivalent of DataFrame(convert(Matrix, x))
?
EDIT: to complement this, KeyedArrays' README says
As for NamedDims.jl, the guiding idea is that every operation which could be done on ordinary arrays should still produce the same data, but propagate the extra information (names/keys), and error if it conflicts.
Also see previous discussion on similar issue in NamedArrays: https://github.com/davidavdav/NamedArrays.jl/issues/76.
@nalimilan - I think there are two issues:
Yes, right. I think your PR is right on its own (as I guess we can consider that the fact that columns names are x1
, etc. isn't part of the contract for AbstractMatrix
inputs, it's just a fallback when no better names are available). Though note that retaining column names while discarding row names isn't the most likely use case: it could be useful to have an argument to choose to copy row names to a column, which isn't possible to implement with the existing Tables.jl API AFAIK.
Now the problem regarding 2. is that AFAICT it's perfectly legitimate for AxisKeys to implement the Tables.jl interface as they do (reshaping), since Tables.jl doesn't treat AbstractMatrix
objects in general as tables. The problem is in DataFrames, with the interaction between our AbstractMatrix
constructors and our Tables.jl constructors. I don't see a good solution to this apart from deprecating DataFrame(::AbstractMatrix)
as @rofinn proposed. But that would be annoying.
we have Matrix(::AbstractDataFrame)
in DataFrames.jl. I think we should support AbstractMatrix
by default symmetrically.
Given your comment I would think that what I propose in #2435 is OK (i.e. by default do what we did in the past, but if some AbstractMatrix
type opts-in to be a Tables.jl table then we just do what this type specifies).
Sorry, but after thinking more about this I'm really not happy with the fact that DataFrame(x::KeyedMatrix)
would give a completely different result from DataFrame(x::Matrix)
. Imagine you have a KeyedMatrix
and you would expect to get the equivalent of DataFrame(Matrix(x))
. It would be really weird to give you another result -- and given my experience in R I think that's a reasonable use case. Note that ideally when we'll have agreed on the standard type for "named array", we will use KeyedArray
much more widely, like in R. And other types like sparse matrices might start being considered as tables (with columns i, j and value).
Consistency and predictability are among our strengths, so it's worth trying to preserve these. So I think we need a way to make our API unambiguous about the meaning of the input argument. I'm not sure yet what's the best solution though. It could be using Tables.table
or an argument to the constructors.
As I said, I think it would also be useful to have a way to insert a column containing the row names, and use the KeyedArray
column names without reshaping the data at all. This doesn't have to be supported right now but it's worth keeping this in mind.
So in short you say: let us deprecate DataFrame(:;AbstractMatrix)
and DataFrame(:;AbstractMatrix, colnames)
methods and require users to use Tables.table
wrapper (unless their AbstractMatrix
passes istable
when user can just create it). Is this correct? This is the approach Tables.jl uses. Do I get you correctly?
(note that also DataFrame(collect(eachcol(x)))
works and actually this would be a deprecation message)
If there is a consensus we prefer this (CC @oxinabox @pdeffebach) then I will not object, but this will create a small inconvenience.
For me - in general - it is natural that a matrix is converted to a data frame the way we do it currently.
I don't think of a matrix as collection of columns (or as a collection of rows) but rather as its own thing, without a natural division into components except its elements via linear indexing.
DataFrame(::AbstractMatrix)
is extra weird and i doubly any one really uses it beyond as a toy.
Since it jsut makes up random column names out of nothing
DataFrame(::AbstractMatrix, colnames)
is less weird but still weird because of the afformentioned reasons.
I think DataFrame(eachcol(matrix), colnames)
or DataFrame(colnames .=> eachcol(matrix)))
is much more natural.
Also naturally extends to eachrow
which is good if you are storing 1 column per observation (which can be good for performance reasons).
Fair enough. If there will be no more comments on this I will deprecate both methods (with and without column names passed).
In particular I agree that:
DataFrame(colnames .=> eachcol(matrix)))
and
DataFrame(collect(eachcol(matrix)))
are explicit.
Regarding the DataFrame(eachcol(matrix))
replacement, maybe DataFrame(Tables.table(matrix))
is easier to explain/understand than DataFrame(collect(eachcol(matrix)))
?
I don't have a strong opinion regarding DataFrame(::AbstractMatrix, colnames)
. DataFrame(eachcol(matrix), colnames)
would be a nice replacement, if we can widen the signature for the first argument.
if we can widen the signature for the first argument
We cannot widen the signature as Generator
does not have eltype
information and eachcol
unfortunately uses a generic generator so we cannot specialize on it (actually we might make a PR to Julia base to change this, as knowing that the object is actually generated by eachcol
would be nice. We could even think of making it an immutable <:AbstractVector
).
DataFrame(Tables.table(matrix))
is easier to explain/understand
It is, but it produces different column names, therefore the deprecation message should be with collect
.
I think DataFrame(matrix)
is useful. I certainly use it when debugging a lot, which isn't a good reason to keep it. With regards to uses, I definitely use it in dplyr
. Here is some code I'm working on now.
P_S_df = as_tibble(P_S, .name_repair = "unique")
names(P_S_df) = paste("P_S", 1:n_ind, sep = "_")
This comes from doing a bunch of calculations where everything is a matrix, for model fitting, and then turning the big matrices back to data-frames to report results. This seems like a common use-case, no?
We cannot widen the signature as
Generator
does not haveeltype
information andeachcol
unfortunately uses a generic generator so we cannot specialize on it (actually we might make a PR to Julia base to change this, as knowing that the object is actually generated byeachcol
would be nice. We could even think of making it an immutable<:AbstractVector
).
Yes, I was only thinking of allowing ::Any
when the second argument is a Vector{Symbol}
. But maybe that's too confusing. (I think having a special EachCol
or EachSlice
object in Base has been discussed, but since the design isn't completely settled yet the conclusion was that it was safer to use a generator for now so that people don't rely on a specific implementation.)
@pdeffebach Nobody is proposing to remove the ability to build a data frame from a matrix -- only to use a slightly more verbose syntax. Does that sounds OK to you? The dplyr example you give sounds quite verbose too. :-p
The only problem is that using Tables.table
requires the header
argument (which is kwarg) to be a vector of Vector{Symbol}
, while we accept strings and and AbstractVector
s.
The alternative route would be to say that DataFrame
constructor behaves like Tables.table
for matrices (this is essentially what we do), and say that if "by chance" an AbstractMatrix
implements Tables.jl interface you have to use DataFrame(Tables.columns(A))
- this is not super long either.
We have such a situation in other cases already, e.g.:
julia> DataFrame([:a => [1,2,3], :b => [4,5,6]])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> DataFrame(Tables.columns([:a => [1,2,3], :b => [4,5,6]]))
2×2 DataFrame
│ Row │ first │ second │
│ │ Symbol │ Array… │
├─────┼────────┼───────────┤
│ 1 │ a │ [1, 2, 3] │
│ 2 │ b │ [4, 5, 6] │
EDIT: the question is what happens more often: wanting matrix to be converted into DataFrame
in a legacy way, or having a matrix that has some custom layout when converted to a DataFrame
?
If others are on board then I think this is okay. I'm wary of exposing a Tables.*
function to the user, though. The only other time we do this is when we recommend Tables.namedtupleiterator
, but that's only for "advanced" users looking for a faster eachrow
.
So I have a preference for collect(eachcol(mat))
The only problem is that using
Tables.table
requires theheader
argument (which is kwarg) to be a vector ofVector{Symbol}
, while we accept strings and andAbstractVector
s.
We could change that in Tables: accepting strings sounds reasonable. But as long as we don't use it for the deprecation it's not the end of the world if it's not a perfect replacement.
Anyway I agree that we should weigh the pros and cons of each solution, depending on which case is the most commonly needed. However, apart from that criterion, I think the most consistent approach is to treat as many inputs as possible as Tables.jl objects. This isn't possible in all cases as @bkamins showed, because of our basic constructors, but we can reduce to a minimum the number of exceptions, so that you know that when you have a Tables.jl object is will be treated as such, whatever its exact type.
so that you know that when you have a Tables.jl object is will be treated as such, whatever its exact type
Agreed, but still we will have "some" exceptions and the question is if AbstractMatrix
is a justifiable exception (also keeping in mind that it would be breaking to remove this support - so if we think it is an edge case - then maybe it is better to be non-breaking). In general - if someone wants to be safe using Tables.columns
is the way to ensure all is processed correctly.
DataFrame(colnames.=>eachcol(matrix))
and DataFrame(collect(eachcol(matrix)))
, and DataFrame(table)
seem like enough.
that seems like all the constructors we need are:
Either it is:
This are more or less fully distinct and clear
I don't think we need to mention using Tables.jl on a Matrix in the first place, since that is just a special case of the ubiquitious table sink constructor and would be assumed to work by anyone who even knows what tables are.
I don't think we need to mention using Tables.jl on a Matrix in the first place, since that is just a special case of the ubiquitious table sink constructor and would be assumed to work by anyone who even knows what tables are.
Thats a good point I hadn't thought of. We can just throw an error and people can be aware of what a table is and figure it out if they want a matrix. I'm on board.
a AbstractVector of columns
Constructing names automatically is annoying, so its nice to provide a way to construct a data frame where you dont care about the names.
Let me give you a full menagerie that we support. I divided it into three groups:
FWIW, that seems reasonable to me... though I mostly only use the pairs and tables constructors. Shouldn't this constructor be addressed by the tables interface?
DataFrame(x::AbstractArray{NamedTuple{names,T},1}; copycols) where {names, T} # vector of named tuples
Isn't that just a rowtable?
Yes; that is indeed handled by Tables.jl interface; it's just needed to avoid dispatching to some of the other ::AbstractVector
constructors
Thanks for the list. So in the "disputable" list we have a series of constructors that accept tuples, similar to constructors that accept vectors, right? Not sure how useful they are but they probably don't do much harm either.
AFAICT, the only ones conflicting with the interpretation of the input as a Table are the AbstractMatrix
ones that we're discussing here, right?
I would deprecate all the ones in the "maybe" category except for Dict
.
So my thinking and understanding is:
AbstractMatrix
constructors and we could drop them (it would be great to have an agreement to have https://github.com/JuliaData/Tables.jl/issues/199 then as current signature of Tables.table
is too restrictive in comparison to what we allow in DataFrames.jl)Tuple
constructors in "to drop" as no one ever asked me about them (so probably they are not useful and it is easy enough to convert a tuple to a vector if really needed); I would vote to remove them then, as it will simplify the API (list of things to learn and maintain)eltype
constructors are a different case. Actually here I regularly get questions how to create an uninitialized data frame with certain eltypes of columns (people populate these columns afterwards); one can of course write e.g. DataFrame([T[] for T in [Int, Float64, Char]])
but I am not sure if this is obvious for a typical userSo I feel that only 3. is disputable (and that is why I put it in the second group). Can you give some more feedback here given these considerations please?
I would get rid of the eltype one also. The more constructors we get rid of means there are less to look for.
I think DataFrame(cost=Decimal[],. product=String[], weight=Float64[])
Is clearest. But also
df=DataFrames()
push!(df, (cost=dec"1.20", product="drink", weight=550.0))
works (right?) and often is what they really wanted.
works (right?) and often is what they really wanted.
works, we call it pseudo-broadcasting, https://bkamins.github.io/julialang/2020/09/13/pseudobroadcasting.html.
OK - so I will wait for 1-2 weeks for more feedback. If nothing changes we also prune the eltype
constructors (I agree they are not idiomatic to use).
See https://github.com/JuliaData/DataFrames.jl/pull/2464 for a fix.
I feel like it's rare that folks actually want to call this constructor without specifying column names. Also, it isn't that hard to do this with the tables interface via
DataFrame(Tables.table(X))
.The benefit with dropping this constructor is that it would allow more inputs to fall back to the tables constructor, including array types that implement the tables interface like AxisKeys.jl. Currently, you need to special case
DataFrame(A)
toDataFrame(Tables.columns(A))
only whenA
is 2 dimensional.