Closed quinnj closed 9 years ago
I've got a ton of datasets that I can send over. I particularly like optimizing against the files Wes used for Pandas: http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/
I'd appreciate any and all test files people have access to and are willing to share. Feel free to share a link or email me personally.
Great find @ihnorton, I'll definitely do some benchmarking vs. that site.
Sent you an e-mail, @quinnj, with some of the bigger datasets I've been profiling against.
Here's a discussion on readdlm
that showed up on my linkedin feed:
https://www.linkedin.com/grp/post/5144163-5836134323916398595
The default return type of CSV.read
is Data.Table
which is a very thin wrapper type around any kind of data
argument (by default a Vector{NullableVector}
, but could be a DataFrame, Vector{Vector}, Matrix, etc.). Things are also pretty composable through the DataStreams
framework to support any number of "sink" types; i.e. the SQLite package supports parsing a CSV file directly to an SQLite table, or vice versa. What's nice is that CSV is currently composable enough that writing a new "sink" type usually doesn't require more than 10-15 lines of code. Indeed, the code for parsing a CSV.Source
into a Vector{NullableVector}
is
function getfield!{T}(io::IOBuffer, dest::NullableVector{T}, ::Type{T}, opts, row, col)
@inbounds val, null = CSV.getfield(io, T, opts, row, col) # row + datarow
@inbounds dest.values[row], dest.isnull[row] = val, null
return
end
function Data.stream!(source::CSV.Source,sink::Data.Table)
rows, cols = size(source)
types = Data.types(source)
for row = 1:rows, col = 1:cols
@inbounds T = types[col]
CSV.getfield!(source.data, Data.unsafe_column(sink, col, T), T, source.options, row, col) # row + datarow
end
return sink
end
I'd like to decide on the Julia structure that
CSV.read()
returns. Speak now or forever hold your peace (or write your own parser, i don't care). The current candidates are:I'm leaning towards
Dict{String,NullableArray{T}}
as it's the most straightforward@johnmyleswhite @davidagold @StefanKarpinski @jiahao @RaviMohan