JuliaData / TypedTables.jl

Simple, fast, column-based storage for data analysis in Julia
Other
145 stars 25 forks source link

BUG - eachcol returns wrong columns #68

Closed juliohm closed 3 years ago

juliohm commented 3 years ago
using TypedTables

t = Table(a=[1,2,3], b=[4,5,6])

collect(eachcol(t))

1-element Vector{SubArray{NamedTuple{(:a, :b), Tuple{Int64, Int64}}, 1, Base.ReshapedArray{NamedTuple{(:a, :b), Tuple{Int64, Int64}}, 2, Table{NamedTuple{(:a, :b), Tuple{Int64, Int64}}, 1, NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Int64}}}}, Tuple{}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}:
 [(a = 1, b = 4), (a = 2, b = 5), (a = 3, b = 6)]

This is affecting downstream code where I used to use DataFrames.jl.

andyferris commented 3 years ago

Hi @juliohm.

As far as I was aware, columns from Tables.jl is the function you would want for iterating table columns. The eachcol funtion was added to Base in order to iterate columns of Matrix, similarly to eachrow. In typed tables we model a table as an AbstractArray so we must follow the AbstractArray interfaces exported by Base. Meanwhile Tables.jl exports rows and columns for these iterators. In fact, a typed table can be an AbstractMatrix and you may wish to iterate those columns/rows via eachcol and eachrow.

It seems that DataFrames.jl makes a pun and uses eachcol to get a DataFrameColumns object - yet columns(df) gives df. We should open an issue with them.

juliohm commented 3 years ago

Thank you @andyferris , makes a lot of sense.

andyferris commented 3 years ago

OK let's see what they say - https://github.com/JuliaData/DataFrames.jl/issues/2636

By the way - if you want to make a DataFrame from a TypedTables.Table, doesn't DataFrame(table) work?

juliohm commented 3 years ago

Yes, it works. I was just reporting a behavior I thought was strange. I was able to call Tables.matrix(t) to get back a simple matrix inside a calculation.

Em dom., 28 de fev. de 2021 às 18:11, Andy Ferris notifications@github.com escreveu:

Closed #68 https://github.com/JuliaData/TypedTables.jl/issues/68.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/TypedTables.jl/issues/68#event-4386542945, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3KEGQTVMABPMG4QT6TTBKWRXANCNFSM4YLKLAAQ .

quinnj commented 3 years ago

columns from Tables.jl is the function you would want for iterating table columns.

This isn't completely accurate. As of the v1.0 Tables.jl release, Tables.columns(x) is required to return an object that implements Tables.columnnames(obj) and Tables.getcolumn(obj, nm::Symbol) and Tables.getcolumn(obj, i::Int), but not iteration. A convenience wrapper, Tables.Columns(x) can be used to turn any result of Tables.column into an object that supports iteration (over columns), getindex, and getproperty.

Tables.jl exports rows and columns

They technically aren't exports (not physically exported), though they are indeed critical functions of the public API.

yet columns(df) gives df

This is a valid implementation of the Tables.jl interface because, as mentioned above, it supports Tables.columnnames and Tables.getcolumn methods.

andyferris commented 3 years ago

Given Tables.columnnames and Tables.getcolumn isn't it trivial to define an iterator, though? What's the advantage of not requiring particular behavior from iterate?

Or more to the point - does there exist a function in Tables.jl exist the helps you iterate over columns of a table?

quinnj commented 3 years ago

does there exist a function in Tables.jl exist the helps you iterate over columns of a table

Yes, as mentioned, if you do cols = Tables.Columns(Tables.columns(input)), the Columns object you get back will iterate columns. It also supports operations like cols.col1, or cols[:col1] or cols[1], making it "dataframe-like" in a way. There have been a few discussions among JuliaData devs around simplifying so that Tables.Columns automatically calls Tables.columns for your underneath, since it seems quite redundant to users who aren't familiar with the distinction.

andyferris commented 3 years ago

Ok, I see, thanks Jacob. Have you considered having a generic function that returns an object with this stated behaviour, rather than a constructor for a particular type? Like the difference between view vs SubArray? It’s more interface based and generic that way.

E.g. by default return to users a Columns from columns, and have package authors implement either some simpler function (e.g. _columns) or the full columns as they prefer?

quinnj commented 3 years ago

Not really, but maybe it's worth considering. I kind of like keeping the distinction that Tables.columns returns only an object that satisfies the interface, while Tables.Columns would return a specific object that not only satisfied the interface, but was dispatchable and had additional convenient behavior.