Closed jpmckinney closed 5 years ago
Looking at FastCSV, it states it can only use a certain dialect. If so, we probably can't use it, as csvlint.rb supports all sorts of different dialects; tab-separated, for example. Is that right?
Might there be a way of using it only in the case where the dialect is appropriate?
Maybe, yes.
Another option is to figure out how to implement multiple dialects in FastCSV :) https://github.com/opennorth/fastcsv/issues/2
that too :)
I've added support for dialects using:
:quote_char
:col_sep
(I've never seen a CSV using a multi-byte :col_sep
):row_sep
is still restricted to implementing the :auto
default option.
Do any of the CSVs you've validated use options not yet supported by FastCSV?
Thanks for the work on this so far! :+1:
We're hoping to get some funding later in the year for more detailed work on csvlint.rb and the accompanying web app, so this will be a great complement to our planned improvements :smile:
Now we're doing streaming validation, any mention of row_sep
is probably moot, so this might be worth investigating again. How's progress on FastCSV
going?
I've just tried a dumb drop in replacement on our current master branch (adding FastCSV to the bundle, requiring, then replacing CSV.parse_line
with FastCSV.parse_line
at https://github.com/theodi/csvlint.rb/blob/master/lib/csvlint/validate.rb#L135) and benchmarked like so:
require 'csvlint'
require 'benchmark'
url = "https://dl.dropboxusercontent.com/u/135665/LGADataTest.csv"
@benchmarks = []
6.times do
benchmark = Benchmark.measure {
validator = Csvlint::Validator.new(url)
}
puts benchmark
@benchmarks << benchmark.real
end
average = @benchmarks.inject{ |sum, el| sum + el }.to_f / @benchmarks.size
puts "Average time taken - #{average}"
Without FastCSV I get the following:
13.040000 0.280000 13.320000 ( 17.225819)
13.940000 0.270000 14.210000 ( 17.271949)
13.240000 0.240000 13.480000 ( 16.326915)
12.320000 0.160000 12.480000 ( 14.888282)
12.760000 0.190000 12.950000 ( 19.387910)
13.180000 0.180000 13.360000 ( 16.071267)
Average time taken - 16.86202366666667
With FastCSV I get a further reduction in performance, which I didn't expect:
13.370000 0.560000 13.930000 ( 17.802677)
13.230000 0.490000 13.720000 ( 16.509611)
13.170000 0.500000 13.670000 ( 17.079451)
14.520000 0.640000 15.160000 ( 24.219318)
13.290000 0.530000 13.820000 ( 16.274699)
13.610000 0.540000 14.150000 ( 16.667552)
Average time taken - 18.092218
I could of course, be doing something dumb, any ideas?
Using a URL means your tests will be thrown off by differences in download speeds. I instead used path = 'LGADataTest.csv'
and validator = Csvlint::Validator.new(File.new(path))
, but I get invalid byte sequence in UTF-8
.
Also, looking at the code, you are doing some CSV parsing in Ruby (the parse_line
method). This is the sort of thing that FastCSV made fast.
Instead of substituting FastCSV at this point, I recommend using stackprof to find out where the real bottlenecks are.
As you can see in the call graph in #157, csvlint spends only 5% of its time in CSV#shift, which is the operation that FastCSV optimizes, so you won't see a dramatic improvement until the rest of csvlint is optimized.
Two cucumber failures:
:row_sep
. It implements the:auto
behavior.FastCSV only raises an error for the first error. CSVLint's tests don't catch this deficiency. There's an open issue for FastCSV: https://github.com/opennorth/fastcsv/issues/3
(Old issue description)
This is an experimental replacement of CSV with FastCSV. This PR can be kept open until tests pass.
So far, tests indicate that CSV parsing is dwarfed by
#build_formats
, so #103 is the priority. After #103 is merged, this PR produces more substantial speed improvements. If all my issues are closed, csvlint will have gone from 246s to parse a 12MB file to 7s (35x faster).FastCSV doesn't yet error on inconsistent line endings. There's an open issue for that. https://github.com/opennorth/fastcsv/issues/6 FIXED