Closed nalimilan closed 9 years ago
I'm open to renaming this. But the functionality isn't about stripping data from inputs, it's about interpreting quotes as indicative of being in a quoted region where the semantics of parsing change.
It's usually a code smell in Julia to allow nothing
as an input. I'd happier allowing an array of quotemark
values, which could be empty. But there comes a point at which I hesitate to keep chasing after compatibility with broken CSV files. At some point we have to tell people to just fix their files rather than keep trying to make sense of malformed data. Not sure if multiple quote marks is past that line, but it's pretty close for me.
Sure, you cannot adapt to any weird syntax that a CSV file may have. But I figured making quotemark
a vector was more elegant since it allowed in a quite natural way to both specify no quote mark at all (thus removing allowquotes
) or more than one. R allows this too. But it does not seem that common attempts to "standardize" (or at least summarize) the definition of the CSV format consider '
to be a possible quoting character. So really I've no opinion on that.
Yeah, this one is pretty hard to know what's best. My concern is that we may not really need this functionality for 99.9% of files and that it will slow down the parsing of all the standard files. That concern may be premature, but it would be nice to see if we can add this functionality without making things slower.
Do you think the overhead would be significant? If you apply a loop with a call to over all characters provided in quotemark
, the difference with the situation where you know there's only one character should be negligible.
I don't think this feature is essential, but it would be good to be able to add it one day if needed, so better have the syntax allow it. If Julia is to become one of the reference languages for statistics, a flexible table import function is a must.
I'm not sure what the performance change will be. It really depends on what the compiler does. If, for example, it can hold 1 quotemark in a register, that could be a lot faster than having to loop through an array each time. But I have no idea if that would happen.
I'm totally on board with adding the syntax now, even if we don't actually support it.
So I've gone ahead and tried to measure the overhead. I've created a CSV file with 3 numeric columns and 3 character columns, and 10000 rows. (I've run the code once first to exclude compilation time for measurements.)
With the current code:
@elapsed for i in 1:100 readtable("test/test.csv", allowquotes=false) end
64.288568822
julia> @elapsed for i in 1:100 readtable("test/test.csv") end
65.48854605
q1 = '"'
julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q1) end
59.222476463
With quotemark
a vector and ==
replaced with in
:
julia> q=Array(Char, 0)
0-element Array{Char,1}
julia> q1 = ['"']
1-element Array{Char,1}:
'"'
julia> q2 = ['"', '\'']
2-element Array{Char,1}:
'"'
'\''
julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q) end
66.779568966
# Equivalent to quotemark=q1
julia> @elapsed for i in 1:100 readtable("test/test.csv") end
66.978523463
julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q1) end
67.355751388
julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q2) end
68.473008541
After removing the obsolete allowquotes
:
julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q) end
67.6082216
# Equivalent to quotemark=q1
julia> @elapsed for i in 1:100 readtable("test/test.csv") end
67.342936702
julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q1) end
66.365887766
julia> @elapsed for i in 1:100 readtable("test/test.csv", quotemark=q2) end
67.150638076
So there a slight overhead in the new version, but nothing to worry about. More interesting is the difference between the default version, and the equivalent one specifying quotemark=q1
, which exists in the current code. Not sure where it comes from...
Anyway, these results are with the CSV file alrady in the cache, but if you add disk access to the picture, the differences are pretty small.
This use case shows a potential improvement to add to the compiler: IIUC, one version of the function is compiled for each type signature; it would be interesting to also compile one version for cases where an argument is an empty array, and one for cases where it is not. Here, in
could be optimized away when the array is empty, which would offer some gain while keeping the code very clean. This is probably a frequent pattern. But are keyword arguments taken into account when compiling different versions of a function?
Overall I'm not 100% convinced my proposal is a real enhancement, though I like the idea of passing and empty vector to quotemark
. At least this was an interesting exercise.
(BTW, is there a shorter version of Array(Char, 0)
? Ideally []
would be accepted, since the empty vector is the same for all types.)
FWIW, if you care about speed, Matthew Dowle told me some time ago that his R function fread()
, in package data.table, was really fast compared with R's read.table()
. You may want to have a look at the C source to see whether he found interesting optimizations. This function also detects the delimiter character automatically from the first rows.
http://cran.r-project.org/web/packages/data.table/ http://www.inside-r.org/packages/cran/data.table/docs/fread
EDIT: I just tried on the same CSV file, and the results are impressive:
system.time(for(i in 1:100) fread("~/test/test.csv"))
user system elapsed
6.207 0.076 6.327
10x faster than readtable()
, even with detection of delimiter!
Though even R's read.table()
is 2x faster as readtable()
:
system.time(for(i in 1:100) read.csv("~/test/test.csv"))
user system elapsed
37.366 0.200 37.739
We can't even read the code for data.table
since it's GPL.
Someone could read it and then explain it to someone else who implements it. I'm pretty sure avoids any copyright violation danger.
Or you could add Matthew as a co-author provided he relicenses the Julia code under MIT. But seriously, the C code will not be translated to Julia directly, so the differences will so huge the license does not apply. ;-)
I'm not sure about that. Certainly if I read a book, describe it you in a paraphrase and then you write a new book from that paraphrase, there are potential copyright violations that come on.
Asking Matthew seems easier. Somebody who's done less to antagonize the R community should probably do that.
Read the code, and if you find anything that sounds useful and original enough to require licensing, I'll ask him for permission. If he does not agree, you'll just have to wait for a few months until you have forgotten enough of the details so that it's no longer a derivative work... ;-)
Certainly if I read a book, describe it you in a paraphrase and then you write a new book from that paraphrase, there are potential copyright violations that come on.
I'm fairly certain that would not be a copyright violation. Sketchy and probably not very moral, but not a copyright violation.
The docs at http://juliastats.github.io/DataFrames.jl/io.html say the default value of
allowquotes
isfalse
, but it is actuallytrue
. It's just that the wording is backwards, as it says "ignore".The whole naming of this argument is weird, anyway. The question is not whether or not quotes are allowed, but rather whether or not they are stripped from the data on input.
Maybe the argument could be removed altogether: just set
quotemark
to an empty value ornothing
. This raises another problem:quotemark
should allow passing several values, or none at all, which can be useful if you have"
and'
in the same file (always possible in the dangerous world of CSV).