JuliaData / IndexedTables.jl

Flexible tables with ordered indices
https://juliadb.org
MIT License
121 stars 37 forks source link

IndexedTable from TextParse.csvread output #92

Open tcovert opened 6 years ago

tcovert commented 6 years ago

I couldn't find an example in the documentation for how one builds an IndexedTable, with column names, from the output of TextParse.csvread. This appears to work (took me a while...):

tp = csvread("file.csv")
t = table(tp[1]..., names = map(Symbol, tp[2]))

I would have thought something like table(csvread("file.csv")) or even table(Columns(csvread("file.csv"))) would work but both give an error like:

julia> table(TextParse.csvread("file.csv", escapechar='"', nastrings = vcat(TextParse.NA_STRINGS, "Confidential")))
ERROR: MethodError: no method matching _impl(::Tuple{Array{Int64,1},Array{Int64,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},DataValues.DataValueArray{Date,1},DataValues.DataValueArray{Int64,1},Array{String,1},Array{String,1},Array{String,1},Array{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},DataValues.DataValueArray{Float64,1},DataValues.DataValueArray{Float64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},DataValues.DataValueArray{Int64,1},DataValues.DataValueArray{Date,1}}, ::Array{String,1})

I guess this isn't a bug, but if the goal is for the first syntax above to be the recommended syntax, it might be helpful for there to be an example in the docstrings...

davidanthoff commented 6 years ago

I'll fix the iterable tables integration story for the new style API here. Then this

load("file.scv") |> table
# or
table(load("file.csv"))

should work.

I've also been sitting on a small extension to TableTraits.jl that will remove the perf overhead of going through the iterable tables system, i.e. at that point it should give the raw performance of TextParse (which is used by the FileIO story under the hood).

davidanthoff commented 6 years ago

Ha, I just realized that this new NextTable is already an iterable table source, without any integration code, simply because it iterates named tuples! Very nice. So all of the following already works:

t = table(...)

# File IO
save("file.csv", t)
save("file.feather", t)
t |> save("file.csv")
t |> save("file.feather")

# Convert to other table structure
df = t |> DataFrames.DataFrame # This pipe syntax should work for all constructors of table types
df = DataFrames.DataFrame(t)
tt = TypedTables.TypedTable(t)
Pandas.DataFrame(t)
TimeSeries.TimeArray(t)
Temporal.TS(t)

# Run a regression
lm(@formula(Children~Age),t)

# Plot with Gadfly
plot(t, x=:Age, y=:Children, Geom.line)

# Plot with StatPlots
@df t plot(:Age, :Children)

# Plot with VegaLite
forgot the syntax ;)

And no, I haven't tried, but it really should just work. I hope those aren't famous last words.

shashi commented 6 years ago

There was some discussion about this on the data channel on Slack. If we ever move Columns out of this package, then I'll make it so that TextParse.csvread returns Columns -- then this should just work. For now, IterableTables interface sounds fine.

davidanthoff commented 6 years ago

The table traits extension I’ve mentioned will actually end up just passing the arrays from TextParse to tables directly without any iteration, so at that point it should add hardly any overhead at all. Should all be completely transparent in terms of user facing API, i.e. the code from above will be the same, just much faster. Probably also won’t require any code changes in IndexTables either.