JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
465 stars 140 forks source link

CSV.read() return value #2

Closed quinnj closed 8 years ago

quinnj commented 9 years ago

I'd like to decide on the Julia structure that CSV.read() returns. Speak now or forever hold your peace (or write your own parser, i don't care). The current candidates are:

I'm leaning towards Dict{String,NullableArray{T}} as it's the most straightforward

@johnmyleswhite @davidagold @StefanKarpinski @jiahao @RaviMohan

jiahao commented 9 years ago

How is Tables.Table represented?

cc @jakebolewski

quinnj commented 9 years ago

A Tables.Table is a somewhat opaque Julia struct to an underlying in-memory or on-disk SQLite database. It defines common "table"/"frame" like operations that actually generate SQL statements that get executed underneath (e.g. indexing columns/rows, boolean indexing, group by, sorting, etc.). The idea is for it to "feel" like a dataframe-like object with the full SQL table engine underneath.

I'm still working out the interface for using Julia functions on a Table type since it's entirely compatible through SQLite's UDF interface; just need to iron out how exactly a user does that.

ScottPJones commented 9 years ago

I thought I'd seen something about a NamedTuple package, I wonder if that might work well.

RaviMohan commented 9 years ago

In my data loading experiments, I used Union (Array{Nullable(Int64),1},Array{Nullable(Int64),1} .. ) essentially a union of Arrays of nullable . This is essentially the typed version of Pandas's basic datastucture and seems to work well enough.

That said, my focus was on solving the performance issues readdlm and Dataframe.read had (vs R and Pandas) when trying to load > 10 million rows, which is a different problem from the writing a generic CSV parser. And it works in that we can beat R's read.table handily and comes within a constant factor of R's data.table.fread. In a production grade system, inferring a column's type (vs declaring it) gets tricky, but is still doable .

I am experimenting with approaches to this that don't lose performance, and also with implementation of operators on the loaded dataset, which gets interesting in the presence of really large datasets and calls for query optimization etc. In short, I'm focused on loading and working with really large data sets .

Which is a different, if intersecting, problem from writing a generic CSV Parser, which is much more complex, with many gnarly special cases.

This might be blasphemy but I think the right approach might be to just choose a reasonable datastructure and go with it, and to test for if performance/memory problems and refactor as required.

I vote for Dict{String,NullableArray{T}} .

My only suggestion, fwiw, here is that you might want to do timing/memory checks with large (10 - 100 million row) data sets post implementation to confirm we are getting better performance than @jiahao found with the existing readdlm implementations ( https://github.com/JuliaLang/julia/issues/10428 for details) .

I'm fairly sure that Dict{String,NullableArray{T}} will work . And worst case someone can write a converter to whatever other datastructure is required.

My 2 cents.

@johnmyleswhite @davidagold @StefanKarpinski @jiahao @quinnj

ScottPJones commented 9 years ago

I really don't like either of the Dict{String,...} approaches. It means that you have to be growing N different vectors (N == number of columns), as you read data, and what happens when there is no header.

ViralBShah commented 9 years ago

I really would love to have a DataFrame (not necessarily the current one, but whatever our new implementation will be) be what comes out, which is then easily composable across so many other packages.

StefanKarpinski commented 9 years ago

I think that the CSV reader should ideally be composable with different consumers that construct different data types. I.e. the parsing and type inference logic should go in one place while different consumers can provide different logic for constructing different data structures. Going even further, it should be possible to decouple the input format from this as well so that you can N data formats and M data types and not require N*M readers to handle all possible combinations. The tricky bit is getting the abstraction right to allow this flexibility without sacrificing performance.

johnmyleswhite commented 9 years ago

Will comment more on this later, but I started making an abstract CSV parser once in the past: https://github.com/johnmyleswhite/CSVReaders.jl

FWIW, I'm much more excited about Jacob's work than about finishing the one I started.

ScottPJones commented 9 years ago

Just looked at that, looks like a nice abstraction, would it make sense to fit Jacob's code into that framework, adding support for some of the stuff that Jacob described above?

StefanKarpinski commented 9 years ago

@johnmyleswhite, how do you feel about the idea of that abstraction, having given it a try? Are you meh on the idea of that decoupling in principle or just want a fresh start on the implementation and/or design?

johnmyleswhite commented 9 years ago

I like the idea as a goal to aspire to, but I'd have to work on it again to say whether it could ever be made to work. It's a use case where you need to be obsessive about inlining -- but I'm not still sure that's sufficient because some type information might not be available at the right points in time.

quinnj commented 9 years ago

FWIW, my Pipelines.jl idea is the same kind of I/O abstraction for hooking up arbitrary readers/writers. I've come to realize that it's certainly more of a "phase 2" kind of idea where a lot of the actual readers/writers still need a lot of development (hence my push on CSV, SQLite, ODBC, etc.).

I certainly think it's the right long-term idea, but for now, I'm more focused on doing a few connections between types (CSV=>Julia structure, CSV=>SQLite table, ODBC=>CSV, ODBC=SQLite, etc.).

For this issue, I'd just like to figure out a good Julia-side structure to return the contents of a CSV file. I think I'm leaning towards Dict{String,NullableVector{T}} or DataFrame with NullableVector{T} columns.

johnmyleswhite commented 9 years ago

I'd suggest OrderedDict{String,NullableVector{T}}. That retains more information about the input format, which might help in debugging.

ScottPJones commented 9 years ago

@quinnj @johnmyleswhite Are you not concerned about 1) what to do with headerless CSV files and 2) performance issues of having to expand each column separately, when a new row is added? I've frequently had to deal with data with hundreds of columns, it also seems that access to a row would be much slower, due to hitting different cache lines for each column. Of course, there are use cases for columnar databases, and the OrderedDict might be a good structure for that, maybe one should be able to create either an OrderedDict or a NamedTuple for a CSV input.

quinnj commented 9 years ago

Could point, John. Where did Jeff's PR for making Base.Dict ordered by default land? It'd be great to avoid another package dependency for this.

@ScottPJones, nope not concerned. For headerless CSVs (which are the minority in my experience), we can auto-generate column names (i.e. "Column1", "Column2", etc.) like DataFrames, R's dataframe, data.table, pandas, etc. all do and is pretty much standard in these cases. Note you can also provide your own column names if you want.

I'm also not sure what your point about "expand each column separately" means. You can't add rows to a Matrix anyway, so whether it's a Dict{String,Vector} or a DataFrame, it's the same operation of adding additional elements to the individual columns. I'm also not sure why would we necessarily care about the performance of adding rows to a Julia structure in a CSV parsing package? That seems like a concern for Base or DataStructures or DataFrames.

ScottPJones commented 9 years ago

When you read the next row from a CSV, then you need to push! the value for each column into a separate Vector, right? When you access it, you need to fetch from M Vectors if you want to get a particular row.

quinnj commented 9 years ago

No, all the fastest CSV readers pre-allocate vectors for parsing; no pushing necessary.

When you access it, you need to fetch from M Vectors if you want to get a particular row.

Yes, just like DataFrames and to a certain extent, Julia arrays in general (all column-oriented).

ScottPJones commented 9 years ago

No, all the fastest CSV readers pre-allocate vectors for parsing; no pushing necessary.

That still doesn't address the issue of having to hit M (number of columns) cache-lines for every set or access. Just because Julia or Fortran 2 or more dimensional arrays are column-oriented, doesn't necessarily mean that that's the best data structure for a CSV file. At JuliaCon I heard some people say that they'd rather have row-oriented anyway (for at least certain calculations apparently it was more efficient). You can also use NamedTuples, which would nicely preserve the type information, IIUC.

quinnj commented 9 years ago

@ScottPJones, I'd love to see you take a crack at a CSV parser! I think it may open your eyes a little more to the real costs and issues; it feels a little like you're picking specific topics you think may be issues without really looking at the current code or having tried things out yourself.

Potential cache-line misses and most other implementation nuances like this are totally swamped by the actual text parsing => typed value operation. Do some profiling on CSV.read() and you'll see it all comes back to CSV.readfield for different types. Now, we do get some nice speedups by mmapping, and the biggest advantage in this package over Base is the alternative use of the current CString type which ends up being much more efficient than current Base string types. But overall, there's a reason that this package now pretty much matches data.table's fread and pandas csv reader: you eventually hit an optimization limit constrained by text-parsing.

The current issues with row-orientation are the fact that you can't actually make tuples! Go ahead, try to write a row parser that returns a tuple. That's what Keno is currently working on with https://github.com/JuliaLang/julia/pull/12113. Even with that functionality, I'm still not sure on the tradeoff of having to continuously allocate row tuples while parsing vs. the current industry-wide approach of pre-allocating typed column vectors. Sure there are interface arguments where someone may want to parse by iterating over row tuples, but from an overall performance perspective, I'm going to guess that you're going to have a hard time making row-oriented parsing as fast as column-oriented parsing; not to mention that other language CSV readers would probably all be row-oriented if that was a magic solution to being faster.

ScottPJones commented 9 years ago

@quinnj I have written a CSV parser before, and I'm just raising the issues that I remember from years ago (plus ones that are specific to Julia). The thing is, the whole point of parsing a CSV file is then being able to use the data afterwards. Parsing does need to be fast, but not at the expense of having efficient data structures produced.

I'm not sure about the "current industry-wide approach" you talk about, can you give examples? I was in industry for a long time, and didn't see (or write) any parsers that acted like that. Just look at the Java and Python CSV parsers, and they all returns rows.

quinnj commented 9 years ago

@ScottPJones, your participation in this discussion is becoming increasingly frustrating. I would really recommend you taking some time to look at the code in this repository, the DataFrames readtable code, and John's CSVReaders code, and take some time to actually code something that does it better, to get a better feel for the kind of issues we're discussing for parsing CSV files into appropriately typed Julia structures. You may have a lot of "industry experience", but it's so far I feel like I'm catching you up to what we're actually trying to do here vs. other, current best-in-class CSV readers (which I've already mentioned several times so far). If you just want a python or Java CSV reader, Base has readline and split. We're clearly going for something more powerful here.

ScottPJones commented 9 years ago

@quinnj Why are you getting frustrated? I'd already looked both at your code, John's code (which I already had said that I had), the DataFrames code, and years ago did code something that arguably did it better. You talk about "current best-in-class CSV readers", but you've only mentioned Pandas, some things in R, and mentioned SQLite (which is row oriented also). I really don't see those as being "industry-wide" either, unless your definition of "industry-wide" is limited to numerical/scientific computing. It would be good to have something more powerful, yes, and I think @StefanKarpinski and @johnmyleswhite 's comments point to ways of possibly doing that.

ViralBShah commented 9 years ago

@ScottPJones I do agree with @quinnj that your comments here are quite frustrating. It sounds like you have mounted a legal defense about every line Jacob has said. You have brought in a number of unrelated issues to this comment, and how is "someone saying something at JuliaCon" even meaningful to support any argument? And then there is the time tested "I have implemented everything before and I know what is right" argument. What do cache lines have to do with this discussion. Please see the original topic. I suggest you please delete all your comments on this thread so that we can keep the discussion to the topic at hand.

ScottPJones commented 9 years ago

@ViralBShah I would say that @quinnj 's comments have been rather frustrating. He started this off by asking for recommendations, and since I've implemented a CSV parser before, I responded. Would you rather that only people with no experience implementing this before respond? He then started making comments saying that I should read the code, which I already had. He has also waived his hands about "current industry-wide approach" and "current best-in-class CSV readers", without any supporting evidence of any such thing. I also don't know where you get the "legal defense", if you read my comments, I'd been supportive of the effort, I just was trying to answer his original question, which was about the output format. I stated exactly why, from a performance standpoint of actually using the data once it had been parsed, that for many use cases, a row/record oriented format would be more efficient (which has to do with cache lines).

Your comment is actually the one that has absolutely nothing technical to contribute.

JeffBezanson commented 9 years ago

Ok, if you reject the examples of Pandas and R as good CSV readers, please point us to the CSV reader we should be looking at or comparing against instead.

carnaval commented 9 years ago

@ScottPJones Talking about the performance of a data structure without specifying access pattern is meaningless.

Column oriented :

Since the majority use case is linear processing, column oriented is definitely the way to go here. If you want to do random access and find out that memory access is limiting (i.e. even if you do random access you have a small very hot working set that barely fits in cache) then you'll have to build an appropriate data structure and ingest the csv in it (i.e. what databases do).

The goal is not to have the CSV parser spit out a data structure which works for every use case since such a thing does not exist.

ScottPJones commented 9 years ago

@JeffBezanson I never rejected them, I just don't seem them as being "industry wide", R in particular is very much specific to a particular segment.

@carnaval If you look at what I wrote earlier, I said very specifically that there were definitely use cases where a column oriented structure would be better, and that it would be good if the parser could create either a column oriented structure and something like a NamedTuple structure. Your experiences may have been different than mine as far as column vs. row orientation for this sort of thing, I was usually dealing with pulling in and processing very large CSV (frequently tab separated instead of comma separated) files, to suck in data written out by RDBMSs, frequently with 100s of columns. If you try to process things column oriented, you can simply run out of cache lines with a L1 cache, and it also depends on the associativity of the cache. Also, how come you say that the majority use case is linear processing? My use cases would be more loading CSV files into memory, processing chunks of rows, and then outputting them into our own database format. @quinnj had talked about using SQLite for storage, which is also a row-oriented format.

The goal is not to have the CSV parser spit out a data structure which works for every use case since such a thing does not exist.

Again, I said that it would be good if it output into different formats, and not just a single column oriented one.

jiahao commented 9 years ago

The extant CSV parsers I am aware of fall into two categories.

  1. Row-based parsers which returns a iterable over rows. The data in each row are usually not parsed further into other data types like numeric values or dates. Some examples:
  2. Column-based parsers which return DataFrames or other tabular objects, where each entry of each column is parsed further into primitive data types. Examples:

Category 1 parsers can be implemented in Julia essentially with code like

rows = [split(line, ',') for line in readlines(open("data.csv"))]

Essentially all of @quinnj's options fall into Category 2.

P.S. It turns out that node-csv provides a nice test suite of CSV files called csv-spectrum. Worth bookmarking.

jiahao commented 9 years ago

Matrix{Any} has also the very unfortunate side effect of boxing every single data element, which causes Base.readdlm to suffer every time it reads in sufficiently complicated data sets. I've discussed with @RaviMohan the idea that readdlm should be changed to return some other data structure, like a dictionary of columns.

Regarding the Dict{String,Vector{T}} option with sentinel values for nulls: I believe this is essentially the current DataFrames implementation - an associative structure mapping column names to columnar data, with NA sentinels. The problem with this design is that many operations on the data are not type-stable - consider that even getindex on a single element would return Union(T, NAtype).

ScottPJones commented 9 years ago

@jiahao CSV parsers are a bit more complicated than what you've described for "Category 1", read RFC 4180 for even the simple "standard" CSV format, and just because it is "row oriented" doesn't mean that values aren't parsed into numeric values, dates, etc. CSV files are often used as a way of dumping RDBMS tables to be loaded into another DBMS. More recently I've seen more use of JSON for that sort of thing though.

jiahao commented 9 years ago

@ScottPJones what I wrote for the examples I gave in Category 1 is correct. The parsers I linked to literally do not do any other parsing, except for nodecsv and cassava, which optionally parse the strings into objects.

I did read RFC 4180, which states very clearly in the preamble that "It does not specify an Internet standard of any kind." and on page 4 that 'Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files'.

quinnj commented 9 years ago

Thanks for your comment @jiahao; that's a good summary of the landscape that I'm aware of.

Also nice find on the csv-spectrum testing suite; those should complement my set nicely.

At this point, I think I'm going to give Dict{String,NullableVector{T}} and DataFrame with NullableVector{T} columns both a try and report back with some results.

carnaval commented 9 years ago

Also, how come you say that the majority use case is linear processing?

because

processing chunks of rows

In general yes, given enough columns that you fill up L1 by fetching a cache line worth of column data (but not so much that you'll hit L2 anyway before being done with a single row, and that you do little-to-no computation) you have a potential efficiency cliff here. We could argue about performance speculation all day but I can already tell you what's going to happen anyway :

@quinnj will implement what he needs to make parsing fast in julia (i.e. column is easier) and operation typical of numeric things much more efficient (say, mean(column)). If someone (e.g. you ?) feels that in one use case it's not efficient enough, (s)he'll be free to contribute the additional flexibility.

ScottPJones commented 9 years ago

@jiahao I don't see how the code there with split is going to be able to handle even RFC 4180, that is the example that I was referring to. (I put "standard" in quotes, because I'd seen that in RFC 4180 as well).

davidagold commented 9 years ago

What happens to CSV files that are too large to fit in memory? One of the nice aspects of the SQLite DB wrapping Table seemed to be that it unified treatments of such large datasets and memory-tractable datasets.

EDIT: Also, @quinnj would help developing the DataFrame w/ NullableVector{T} columns be useful to you?

simonbyrne commented 9 years ago

One idea I had been kicking around (but haven't written any code for, so take with a grain of salt), was to make the CSV type parametric with a tuple argument, similar to the ideas being discussed in https://github.com/JuliaStats/DataFrames.jl/issues/744.

To make things more concrete, you would have something like

csvfmt = CSVTyped{Tuple{ASCIIString,Float64,Float64}}

which would represent a 3-column csv format, with the corresponding column types. Then you would write something like

a,b,c = readline(file, csvfmt)

which would call parse on each column, returning single values (with known type), or

A,B,C = readall(file, csvfmt)

which would return corresponding vectors (or NullableVectors, or whatever).

Then the general readcsv function would then

  1. read the first row as strings
  2. read the next n (say 20 or so) lines, use those to infer the types of the columns,
  3. construct the appropriate CSVTyped instance, and call that on the remaining lines. The result could then be any of the above structures, or one proposed in the DataFrames.jl issue.
quinnj commented 9 years ago

@simonbyrne, I've actually gone down a similar road of thought. I think there's a lot of potential there, the major hiccup is what @JeffBezanson hilariously pointed out during a JuliaCon hackathon: there's unfortunately no way to "make" a tuple. I got to the point of actually writing the code when I realized this. There's no way to create an "empty" typed tuple and then set the values. I'm hopeful Keno's work will alleviate this, but currently I don't know of any efficient workarounds here.

simonbyrne commented 9 years ago

Can't you do it via a vector?, e.g.

t = [ASCIIString, Float64, Float64]
tt = Tuple{t...}
simonbyrne commented 9 years ago

Oh, sorry, I misunderstood. What I had in mind (but forgot to mention) was that readline and readall would be generated functions (hence the reason for the tuple parameter), which could construct the output tuple directly (after all values have been read and parsed).

EDIT: Roughly what I had in mind:

@generated function readline{Ts}(io::IO, ::CSVTyped{Ts})
    b = Expr(:block)
    r = Expr(:tuple)
    for T in Ts
        g = gensym()
        push!(b.args, :($g = parse($T, readfield(io))))
        push!(r.args, g)
    end
    quote
        $b
        $r
    end
end
ScottPJones commented 9 years ago

I don't get this business about not being able to "make" a tuple.

julia> x(1,2.3)
(a => 1, b => 2.3)

julia> y = (1, 2.35)
(1,2.35)

julia> x(y...)
(a => 1, b => 2.35)

julia> typeof(ans)
NamedTuples._NT_ab{Int64,Float64}

Is that not making a tuple, which can be addressed by field name or index, which retains the types? NamedTuples seem very nice - would be nice to have this feature in the base language.

quinnj commented 9 years ago

Ah, thanks for the code snippet @simonbyrne. Indeed, I hadn't thought of the generated function + Expr(:tuple) option. Definitely worth looking into. Though I worry a little about @JeffBezanson or @vtjnash compiler-shaming us on a 1000 column dataset....

jiahao commented 9 years ago

Is there any risk of running afoul of MAX_TUPLETYPE_LEN in inference.jl?

JeffBezanson commented 9 years ago

Note, you can do things like this:

julia> collect(zip([1],['a'],["b"],[2.5]))
1-element Array{Tuple{Int64,Char,ASCIIString,Float64},1}:
 (1,'a',"b",2.5)

Zip uses ... to achieve this. But yes, when you get to 1000 columns we need a different approach.

ScottPJones commented 9 years ago

@simonbyrne 's idea, of generating the CSV reader, might be the solution to a lot of issues, IMO, and make the performance be able to fly compared to anything else out there. If the generator is able to make different readers, based on the passed in types (so that you could have your own generated CSV parser that handled all numeric fields as decimal floating point, for example, or allowed \ escaping in strings, or parsed the funky Excel ="0000123" work-around to preserve leading 0s, as well as generated different parser code for the number of columns being > some threshold (probably lower than 1000), or to store in an OrderedDict, vector of NamedTuples, DataFrame or whatever, without any run-time performance penalty), that would 1) really showcase Julia's strengths and (IMO) 2) really blow away all other CSV parser implementations.

jiahao commented 9 years ago

@simonbyrne doesn't DataFrames.readnrows! already do some form of generated method specialization for ingest? There is some sort of parametric dispatch on DataFrames.ParseType.

My benchmarking in JuliaLang/julia#10428 showed that DataFrames.readtable is still ~10-20x slower than R's fread (~5-6x slower with gc_disable()). @johnmyleswhite has already commented on inefficiencies relating to the inherent representation of NAs in DataArrays which is responsible for some of the gap; perhaps he and @garborg can comment on the remaining inefficiencies.

johnmyleswhite commented 9 years ago

Can we choose some specific files we want to optimize against? For small files, I think the staged function is going to be very slow, but it's not clear to me how much of a win there is for larger files.

jiahao commented 9 years ago

I have my test TSV files in 10428 we can use as large test data.

simonbyrne commented 9 years ago

Is that file (or a simulated equivalent) available somewhere?

jiahao commented 9 years ago

@simonbyrne the data are proprietary, but I can get you access on the research machine.

ihnorton commented 9 years ago

There is a Medium Moderate Data test file linked from here.

(the latter link being possibly also of interest: benchmarks of various Java CSV parsers)