JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
471 stars 141 forks source link

CSV.read() return value #2

Closed quinnj closed 9 years ago

quinnj commented 9 years ago

I'd like to decide on the Julia structure that CSV.read() returns. Speak now or forever hold your peace (or write your own parser, i don't care). The current candidates are:

I'm leaning towards Dict{String,NullableArray{T}} as it's the most straightforward

@johnmyleswhite @davidagold @StefanKarpinski @jiahao @RaviMohan

johnmyleswhite commented 9 years ago

I've got a ton of datasets that I can send over. I particularly like optimizing against the files Wes used for Pandas: http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/

quinnj commented 9 years ago

I'd appreciate any and all test files people have access to and are willing to share. Feel free to share a link or email me personally.

Great find @ihnorton, I'll definitely do some benchmarking vs. that site.

johnmyleswhite commented 9 years ago

Sent you an e-mail, @quinnj, with some of the bigger datasets I've been profiling against.

ViralBShah commented 9 years ago

Here's a discussion on readdlm that showed up on my linkedin feed:

https://www.linkedin.com/grp/post/5144163-5836134323916398595

quinnj commented 9 years ago

The default return type of CSV.read is Data.Table which is a very thin wrapper type around any kind of data argument (by default a Vector{NullableVector}, but could be a DataFrame, Vector{Vector}, Matrix, etc.). Things are also pretty composable through the DataStreams framework to support any number of "sink" types; i.e. the SQLite package supports parsing a CSV file directly to an SQLite table, or vice versa. What's nice is that CSV is currently composable enough that writing a new "sink" type usually doesn't require more than 10-15 lines of code. Indeed, the code for parsing a CSV.Source into a Vector{NullableVector} is

function getfield!{T}(io::IOBuffer, dest::NullableVector{T}, ::Type{T}, opts, row, col)
    @inbounds val, null = CSV.getfield(io, T, opts, row, col) # row + datarow
    @inbounds dest.values[row], dest.isnull[row] = val, null
    return
end

function Data.stream!(source::CSV.Source,sink::Data.Table)
    rows, cols = size(source)
    types = Data.types(source)
    for row = 1:rows, col = 1:cols
        @inbounds T = types[col]
        CSV.getfield!(source.data, Data.unsafe_column(sink, col, T), T, source.options, row, col) # row + datarow
    end
    return sink
end