csvDataFrame rewrite - Githubissues

HarlanH commented 12 years ago

Currently csvDataFrame uses core csvread into a matrix of Anys then parses each column. Preferred features incude:

don't use csvread
possibly: two-pass read, the first to figure out vector types and sizes, and the second to load the data into the columns
allow UTF-8 column names
detect various dialects of CSV, ala Python's CSV-parsing library
(optionally, by default) use PooledDataVecs for non-numeric data if number of unique values is much less than the total rows
support dates

johnmyleswhite commented 11 years ago

I've set up a branch called csv with very-rough experimental support for reading CSV files according to the RFC4180 conventions. You can see the additions at https://github.com/HarlanH/JuliaData/compare/csv

This branch reflects my (perhaps misguided) hope that the CSV reader we ultimately build can also be used for streaming data processing. That flexibility brings with it some potential problems, which I describe briefly below:

Pros:

New approach is correct for CSV files with standard quoted fields
New approach provides a viable mechanism for constructing DataStream's, which need:
- Mechanism for parsing lines separate from parsing files
- Explicit control over types on a line-by-line basis

Cons:

New approach is about twice as slow as old method, which was already slow
New approach is not yet correct for quotes-inside-of-quotes
New approach is not yet correct for newlines-inside-of-quotes

I think the major performance problems come from the frequent use I've made of cbind and rbind to construct DataFrame's. Other ideas would be very welcome.

doobwa commented 11 years ago

Though I've tried, I haven't been able to do anything that's convincingly better than _jl_dlm_readrow. Because of that, I've been using the setup below for reading data. The state is often just a dictionary, and updater() is an arbitrary function (which often just pushes different elements of the row to different places).

# Assumes infile has a header row
function streamer(infile, updater, state, dlm, eol, max_rows)
    io = open(infile)
    cn = _jl_dlm_readrow(io, dlm, eol)
    counter = 0
    for i = 1:max_rows
        counter += 1
        if counter % 10000 == 0 println("$counter/$max_rows"); end
        vals = _jl_dlm_readrow(io, dlm, eol)
        updater(state, vals, cn, i)
    end
    return state
end

Like you, I want explicit control over types on a line-by-line basis, which the updater() function can do. Let me add another pro to your list: explicit control over which elements to keep. This can be handy when some of the columns have a lot of data you want to ignore (e.g. text fields).

If I want to construct a DataFrame, I use the elements of the dictionary I constructed. That way I avoid all the copying required when using rbind repeatedly. Also, there is some overhead involved with rbinding in the presence of PooledDataVecs, so make sure to watch out for that as well.

StefanKarpinski commented 11 years ago

Btw, guys, I'm going to revamp the I/O and string subsystems soon using a zero-copy approach and hopefully will be able to make this kind of thing really fly.

doobwa commented 11 years ago

Awesome! I can't wait. I've been banging my ahead against this a bit. Can you give a little preview for how you plan to do this?

StefanKarpinski commented 11 years ago

The basic strategy is to absolutely minimize copying and syscalls as much as possible.

When doing string input (the I half of I/O) we use the readv syscall to have the kernel write as much data as it has into heap-allocate byte arrays of some fixed size. The strings that are actually presented to the user are then substrings and rope strings built out of those buffers. Because neither substrings nor rope strings require copying, this is a zero-copy operation.

When doing things like CSV parsing, rather than copying all the string data, you'll just be making new substrings and rope strings referencing the same buffers. Any buffers that become entirely unreferenced will get garbage collected. That means that if you take small substrings of huge input strings, everything will work just fine and memory will get freed as one would like it to.

When printing a big complicated rope string that's constructed this way, the key, again, is to minimize data copying and syscalls. On the output side (the O half of I/O), for a given sequence of substrings and rope strings to output, we construct the appropriate vector of io_vec structs which is also zero-copy, and then just make a single call to writev, which will print the full sequence of appropriate bytes.

There are some issues. One major concern is that indexing into a rope string is going to be much slower than indexing into a byte string. Therefore, I may not want to actually use rope strings but rather do something like a "super string" that keeps a vector of constituent string objects instead together with a table of lengths, allowing faster indexing and even more importantly, fast iteration (the state object might become something other than just an byte index, however :-|).

Another issue is that PCRE regular expressions will not work on complicated structures like rope strings. The current approach to that is to transcode transparently and then apply the operation to the resulting byte string. A possible future approach would be to write a regex engine in pure Julia which could operate on arbitrary string types. Such an engine could get automatic JIT power because writing r"^a+b+$" could generate actual code for that regex (generic code at compile time but once it was applied to a specific string type, it would be jitted to type-specific machine code). We could also follow the lead of Google's RE2 library and limit our semantics to real regular expressions which can only use a fixed amount of stack space and are very fast.

HarlanH commented 11 years ago

Interesting. For most CSV use-cases with DFs, most (but not all) of the columns will be numeric, and will thus be converted to a numeric and the string gc'ed. And at least some of the string columns will presumably be pooled, so only one heap-allocated substring will be need to be stored per value. I wonder if the overhead and complexity of doing the optimization you describe, Stefan, is worth it in our case?

On Thu, Sep 27, 2012 at 5:56 PM, Stefan Karpinski notifications@github.comwrote:

The basic strategy is to absolutely minimize copying and syscalls as much as possible.

When doing string input (the I half of I/O) we use the readv syscall to have the kernel write as much data as it has into heap-allocate byte arrays of some fixed size. The strings that are actually presented to the user are then substrings and rope strings built out of those buffers. Because neither substrings nor rope strings require copying, this is a zero-copy operation.

When doing things like CSV parsing, rather than copying all the string data, you'll just be making new substrings and rope strings referencing the same buffers. Any buffers that become entirely unreferenced will get garbage collected. That means that if you take small substrings of huge input strings, everything will work just fine and memory will get freed as one would like it to.

When printing a big complicated rope string that's constructed this way, the key, again, is to minimize data copying and syscalls. On the output side (the O half of I/O), for a given sequence of substrings and rope strings to output, we construct the appropriate vector of io_vec structs which is also zero-copy, and then just make a single call to writev, which will print the full sequence of appropriate bytes.

There are some issues. One major concern is that indexing into a rope string is going to be much slower than indexing into a byte string. Therefore, I may not want to actually use rope strings but rather do something like a "super string" that keeps a vector of constituent string objects instead together with a table of lengths, allowing faster indexing and even more importantly, fast iteration (the state object might become something other than just an byte index, however :-|).

Another issue is that PCRE regular expressions will not work on complicated structures like rope strings. The current approach to that is to transcode transparently and then apply the operation to the resulting byte string. A possible future approach would be to write a regex engine in pure Julia which could operate on arbitrary string types. Such an engine could get automatic JIT power because writing r"^a+b+$" could generate actual code for that regex (generic code at compile time but once it was applied to a specific string type, it would be jitted to type-specific machine code). We could also follow the lead of Google's RE2 libraryhttp://code.google.com/p/re2/and limit our semantics to real regular expressions which can only use a fixed amount of stack space and are very fast.

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/13#issuecomment-8956399.

johnmyleswhite commented 11 years ago

It's probably worth pointing out that there are at least three things that slow my code down that won't benefit from Stefan's promising improvements on I/O:

Need to run state-machine when splitting lines into pieces. In principle, this is a one-pass algorithm and should be nearly as fast as splitting, but could plausibly not be at present. I'll time it soon to find out.
Creation of single row DataFrame from DataVec's. This needs to be done much more cleanly than I've done it so far. I'm sure this step is wasteful.
Repeated binding of rows into cumulative DataFrame. This step is probably the real disaster for the current system.

StefanKarpinski commented 11 years ago

@HarlanH: those are good points. Data frame's are not a poster child for these optimizations, so perhaps I shouldn't even have brought it up here. Would be interesting to see, however.

@johnmyleswhite: you're probably right about the performance issues. It would probably make sense to try to avoid creating one-row data frames, and it would make even more sense to construct full data vectors before creating the data frame object. These kinds of things are pretty tricky, largely because you don't know how many records you are going to have at the end when you begin unless you do a scan through the document in advance — which actually might make sense for better performance (using wc -l is pretty effective).

HarlanH commented 11 years ago

I was thinking a two-pass algorithm might make sense, as you could figure out data types and such on the first pass, then allocate memory (or disk files, if you're converting to a memory-mapped DF), then re-read and do conversions efficiently on the second pass.

On Sat, Sep 29, 2012 at 5:30 PM, Stefan Karpinski notifications@github.comwrote:

@HarlanH https://github.com/HarlanH: those are good points. Data frame's are not a poster child for these optimizations, so perhaps I shouldn't even have brought it up here. Would be interesting to see, however.

@johnmyleswhite https://github.com/johnmyleswhite: you're probably right about the performance issues. It would probably make sense to try to avoid creating one-row data frames, and it would make even more sense to construct full data vectors before creating the data frame object. These kinds of things are pretty tricky, largely because you don't know how many records you are going to have at the end when you begin unless you do a scan through the document in advance — which actually might make sense for better performance (using wc -l is pretty effective).

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/13#issuecomment-9008460.

johnmyleswhite commented 11 years ago

There's a tension here that may have no clean solution: I want to create row-level data frames by default to handle streaming data that needs to allow algorithms to process a row at a time, but should probably move those single rows one-by-one into a preformed empty DataFrame that can be defined using wc -l along with the metadata. We may need to abandon this for reading a whole DataFrame, but I'll keep working since it would be much nicer to have a coherent architecture for both online and batch processing.

johnmyleswhite commented 11 years ago

I definitely think the two-pass algorithm is right. My plan for tomorrow was to create guess_metadata and calculate_metadata functions that pass through the data to compute types, row number, column number and header-based column names. You need a guess_metadata function if you want to use the same architecture to process an infinite stream of data.

StefanKarpinski commented 11 years ago

Another thing you can do is to read some rows and use the file size and average row size to guess at the necessary amount of memory. I do that here in my 300-line TSV static data key-value server and it works pretty well and cuts down on array resizing but only taking one pass through the data. I still suspect that the two-pass method is better.

johnmyleswhite commented 11 years ago

I just pushed some new code that attempts to pre-allocate an empty data frame, then fill it in row-by-row. The three variants I have all perform almost equally slower than csvDataFrame, which makes me think the problems are calls to cbind, the type conversion steps or the state-machine splitter. Will keep working on this over the week.

johnmyleswhite commented 11 years ago

Just pushed a change that removes the excessive calls to cbind. This new parser is now faster than the old one, while being both closer to the CSV standard and usable for streaming data. I'll clean up the splitter function and then would like to discuss a new API for reading DataFrames from delimited data files that uses this new backend.

I'll also finally be able to code up more substantive examples of streaming data processing, including streaming variance, entropy and correlations, along with an SGD-based linear regression solver that Stefan and I worked on previously without a DataFrame backend. Before the end of the month I'm hoping to have a clean Formula-based API for doing simple regression and classification on 10-1000 GB data sets in Julia.

tshort commented 11 years ago

Here are some ideas for an API for reading delimited files. I like the idea of using dlmread and variants to fill in an AbstractDataFrame, something like:

dlmread(io::IO, df::AbstractDataFrame, options)
csvread(io::IO, df::AbstractDataFrame, options)

That way, you can preallocate and specify the column types and the container type. By using an AbstractDataFrame, the user could specify a DataFrame, a DistributedDataFrame, OnDiskDataFrame, or something else. Because df specifies many of the normal options to reading delimited files, options may only need to specify the delimiter, the number of rows to skip, quoting options, and NA specifications.

We will need to supply helper functions to fill in defaults. We can create a type that holds the arguments to dlmread given above:

type DataFrameFile
    io::IO
    df::AbstractDataFrame
    options
end

dlmread(x::DataFrameFile)

Now, we need some methods to fill in defaults for a DataFrame:

# create an unitialized DataFrame from the given file and options:
DataFrameFile(io::IO)
DataFrameFile(io::IO, options)
    # Take one pass through the file and calculate the number of rows
    # and figure out column types.

Here are some ways you might read delimited files:

d = csvread(DataFrameFile("test.csv"))
d = dlmread(DataFrameFile("test.tsv", @options(sep := '\t', header := true)))

N = Base.countlines("test.csv") - 1
# make a DataFrame with a mixture of column types, including Vectors and DataVecs
df = DataFrame({PooledDataVec(zeros(Int64, N)), PooledDataVec(fill("", N)),
                zeros(N), DataVec(zeros(N)), fill("", N), fill("", N)},
               ["n", "name1", "x", "y", "name2", "name3"])
d = csvread("test.csv", df, @options(skip := 1))

johnmyleswhite commented 11 years ago

This is great, Tom. I'm on board with basically everything and will chime in with some comments tomorrow morning.

tautologico commented 11 years ago

I've used the csv reading from the csv branch on a few occasions now (following the examples in test/parser.jl) and, aside from having to specify column names and types, it worked well for the files I needed (while csvDataFrame didn't work). There were quotes inside quotes but they were simply omitted in the resulting DataFrame. As I didn't need that these values were kept intact, the current behavior was adequate.

johnmyleswhite commented 11 years ago

Good to know. I've done about half of the work on the inference system for column names and types. I'm going to try to finish it tomorrow.

I need to think a lot more carefully about how to handle quotes-inside-quotes correctly. Less clear to me is how one ought to handle quotes inside an unquoted string, i.e. a line like 12,3131,ac"cd"a,ada. Currently the system just removes those quotes.

johnmyleswhite commented 11 years ago

This issue is substantially out of date. Closing.

JuliaData / DataFrames.jl

csvDataFrame rewrite #13