JuliaData / TypedTables.jl

Simple, fast, column-based storage for data analysis in Julia
Other
147 stars 25 forks source link

Issue with Tables.getcolumn by index #86

Open sefffal opened 2 years ago

sefffal commented 2 years ago

Accessing columns through Tables.getcolumn(table, name::Symbol) works as expected, but using Tables.getcolumn(table, ind::Int) does not.

Setup:

using Tables, TypedTables
table = Table(a=rand(300), b=rand(300))
table_nt = (;a=rand(300), b=rand(300))

Expected behaviour:

Tables.getcolumn(table_nt, 2)
300-element Vector{Float64}:
 0.7419591651104771
 0.03643357962428917
 0.511973946658012
 0.7525280472737248
 0.5312671306022833
...

This works with simple named tuples of vectors, as well as DataFrames.

Observed behaviour:

julia> Tables.getcolumn(table, 2)
ERROR: BoundsError: attempt to access 300-element Table{NamedTuple{(:a, :b), Tuple{Float64, Float64}}, 1, NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}}} at index [2]
Stacktrace:
 [1] getcolumn(x::Table{NamedTuple{(:a, :b), Tuple{Float64, Float64}}, 1, NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}}}, i::Int64)
   @ Tables C:\Users\William\.julia\packages\Tables\i6z2B\src\Tables.jl:101
 [2] top-level scope
   @ REPL[65]:1

However, using index 1 returns all columns which is not useful:

julia> Tables.getcolumn(table, 1)
(a = [0.7736170160574704, 0.32973335588180575, 0.17889965718253964, 0.7631323090473862, 0.7800224219389631, 0.08040930668634005, 0.9557133954558753, 0.9979396219551491, 0.15894660237894975, 0.5680381167378448  …  0.6559116874983786, 0.7328418210533515, 0.4856581423782824, 0.33251283450523117, 0.08142486970852292, 0.2259648695642409, 0.39396960265088865, 0.7031534405558856, 0.10224220322748001, 0.14191199646807617], b = [0.017236706415861724, 0.5265418832740683, 0.4268344997706731, 0.46470458360887146, 0.8360733105726028, 0.6032125887699785, 0.9385924928402325, 0.7405311692330161, 0.4201266483743147, 0.9833490878965103  …  0.14241236909936195, 0.29289242214548683, 0.8408873927907317, 0.7439831490645507, 0.6205302905751314, 0.9686022965164416, 0.8139530289474524, 0.823492626767103, 0.04273546220284152, 0.44406075204392326])

Accessing by column name :a or :b works as expected.

Thanks!

andyferris commented 2 years ago

@quinnj any advice on this one?

quinnj commented 2 years ago

In the official "usage" of the Tables.jl interface, you're only guaranteed to be able to call Tables.getcolumn on either: 1) the object returned from Tables.columns(x), or 2) on each iterated element of the object returned by Tables.rows(x). For DataFrames.jl/NamedTuple of vectors, the objects themselves happen to get returned from Tables.columns, but in the case of Table, it's not. So if you do tbl = Tables.columns(table) first, you can get expect to call Tables.getcolumn on the result.

andyferris commented 2 years ago

I see.

Is it good practice to extend some of these methods and opt into common behaviour? Or is it preferable to let users use the columns function?

quinnj commented 2 years ago

All up to you; users of the Tables.jl API just need to make sure they follow the guidelines, which admittedly aren't the absolute most convenient form, but are really meant for "sink" authors in the end.