Documentation for "allowquotes" argument readtable() is wrong

nalimilan commented 11 years ago

The docs at http://juliastats.github.io/DataFrames.jl/io.html say the default value of allowquotes is false, but it is actually true. It's just that the wording is backwards, as it says "ignore".

The whole naming of this argument is weird, anyway. The question is not whether or not quotes are allowed, but rather whether or not they are stripped from the data on input.

Maybe the argument could be removed altogether: just set quotemark to an empty value or nothing. This raises another problem: quotemark should allow passing several values, or none at all, which can be useful if you have " and ' in the same file (always possible in the dangerous world of CSV).

johnmyleswhite commented 11 years ago

I'm open to renaming this. But the functionality isn't about stripping data from inputs, it's about interpreting quotes as indicative of being in a quoted region where the semantics of parsing change.

It's usually a code smell in Julia to allow nothing as an input. I'd happier allowing an array of quotemark values, which could be empty. But there comes a point at which I hesitate to keep chasing after compatibility with broken CSV files. At some point we have to tell people to just fix their files rather than keep trying to make sense of malformed data. Not sure if multiple quote marks is past that line, but it's pretty close for me.

nalimilan commented 11 years ago

Sure, you cannot adapt to any weird syntax that a CSV file may have. But I figured making quotemark a vector was more elegant since it allowed in a quite natural way to both specify no quote mark at all (thus removing allowquotes) or more than one. R allows this too. But it does not seem that common attempts to "standardize" (or at least summarize) the definition of the CSV format consider ' to be a possible quoting character. So really I've no opinion on that.

johnmyleswhite commented 11 years ago

Yeah, this one is pretty hard to know what's best. My concern is that we may not really need this functionality for 99.9% of files and that it will slow down the parsing of all the standard files. That concern may be premature, but it would be nice to see if we can add this functionality without making things slower.

nalimilan commented 11 years ago

Do you think the overhead would be significant? If you apply a loop with a call to over all characters provided in quotemark, the difference with the situation where you know there's only one character should be negligible.

I don't think this feature is essential, but it would be good to be able to add it one day if needed, so better have the syntax allow it. If Julia is to become one of the reference languages for statistics, a flexible table import function is a must.

johnmyleswhite commented 11 years ago

I'm not sure what the performance change will be. It really depends on what the compiler does. If, for example, it can hold 1 quotemark in a register, that could be a lot faster than having to loop through an array each time. But I have no idea if that would happen.

I'm totally on board with adding the syntax now, even if we don't actually support it.

nalimilan commented 11 years ago

So I've gone ahead and tried to measure the overhead. I've created a CSV file with 3 numeric columns and 3 character columns, and 10000 rows. (I've run the code once first to exclude compilation time for measurements.)

With the current code:

@elapsed for i in 1:100 readtable("test/test.csv", allowquotes=false) end 
64.288568822

julia> @elapsed for i in 1:100 readtable("test/test.csv") end
65.48854605

q1 = '"'
julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q1) end
59.222476463

With quotemark a vector and == replaced with in:

julia> q=Array(Char, 0)
0-element Array{Char,1}

julia> q1 = ['"']
1-element Array{Char,1}:
 '"'
julia> q2 = ['"', '\'']
2-element Array{Char,1}:
  '"'
 '\''

julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q) end
66.779568966

# Equivalent to quotemark=q1
julia> @elapsed for i in 1:100 readtable("test/test.csv") end
66.978523463

julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q1) end
67.355751388

julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q2) end
68.473008541

After removing the obsolete allowquotes:

julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q) end
67.6082216

# Equivalent to quotemark=q1
julia> @elapsed for i in 1:100 readtable("test/test.csv") end
67.342936702

julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q1) end
66.365887766

julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q2) end
67.150638076

So there a slight overhead in the new version, but nothing to worry about. More interesting is the difference between the default version, and the equivalent one specifying quotemark=q1, which exists in the current code. Not sure where it comes from...

Anyway, these results are with the CSV file alrady in the cache, but if you add disk access to the picture, the differences are pretty small.

This use case shows a potential improvement to add to the compiler: IIUC, one version of the function is compiled for each type signature; it would be interesting to also compile one version for cases where an argument is an empty array, and one for cases where it is not. Here, in could be optimized away when the array is empty, which would offer some gain while keeping the code very clean. This is probably a frequent pattern. But are keyword arguments taken into account when compiling different versions of a function?

Overall I'm not 100% convinced my proposal is a real enhancement, though I like the idea of passing and empty vector to quotemark. At least this was an interesting exercise.

(BTW, is there a shorter version of Array(Char, 0)? Ideally [] would be accepted, since the empty vector is the same for all types.)

nalimilan commented 11 years ago

FWIW, if you care about speed, Matthew Dowle told me some time ago that his R function fread(), in package data.table, was really fast compared with R's read.table(). You may want to have a look at the C source to see whether he found interesting optimizations. This function also detects the delimiter character automatically from the first rows.

http://cran.r-project.org/web/packages/data.table/ http://www.inside-r.org/packages/cran/data.table/docs/fread

EDIT: I just tried on the same CSV file, and the results are impressive:

system.time(for(i in 1:100) fread("~/test/test.csv"))
        user    system    elapsed
      6.207       0.076       6.327

10x faster than readtable(), even with detection of delimiter!

Though even R's read.table() is 2x faster as readtable():

system.time(for(i in 1:100) read.csv("~/test/test.csv"))
        user    system    elapsed
     37.366       0.200      37.739

johnmyleswhite commented 11 years ago

We can't even read the code for data.table since it's GPL.

StefanKarpinski commented 11 years ago

Someone could read it and then explain it to someone else who implements it. I'm pretty sure avoids any copyright violation danger.

nalimilan commented 11 years ago

Or you could add Matthew as a co-author provided he relicenses the Julia code under MIT. But seriously, the C code will not be translated to Julia directly, so the differences will so huge the license does not apply. ;-)

johnmyleswhite commented 11 years ago

I'm not sure about that. Certainly if I read a book, describe it you in a paraphrase and then you write a new book from that paraphrase, there are potential copyright violations that come on.

johnmyleswhite commented 11 years ago

Asking Matthew seems easier. Somebody who's done less to antagonize the R community should probably do that.

nalimilan commented 11 years ago

Read the code, and if you find anything that sounds useful and original enough to require licensing, I'll ask him for permission. If he does not agree, you'll just have to wait for a few months until you have forgotten enough of the details so that it's no longer a derivative work... ;-)

StefanKarpinski commented 11 years ago

Certainly if I read a book, describe it you in a paraphrase and then you write a new book from that paraphrase, there are potential copyright violations that come on.

I'm fairly certain that would not be a copyright violation. Sketchy and probably not very moral, but not a copyright violation.

JuliaData / DataFrames.jl

Documentation for "allowquotes" argument readtable() is wrong #400