Data-Liberation-Front / csvlint.rb

The gem behind http://csvlint.io
MIT License
287 stars 88 forks source link

Use FastCSV #104

Closed jpmckinney closed 5 years ago

jpmckinney commented 9 years ago

Two cucumber failures:

FastCSV only raises an error for the first error. CSVLint's tests don't catch this deficiency. There's an open issue for FastCSV: https://github.com/opennorth/fastcsv/issues/3


(Old issue description)

This is an experimental replacement of CSV with FastCSV. This PR can be kept open until tests pass.

So far, tests indicate that CSV parsing is dwarfed by #build_formats, so #103 is the priority. After #103 is merged, this PR produces more substantial speed improvements. If all my issues are closed, csvlint will have gone from 246s to parse a 12MB file to 7s (35x faster).

FastCSV doesn't yet error on inconsistent line endings. There's an open issue for that. https://github.com/opennorth/fastcsv/issues/6 FIXED

Floppy commented 9 years ago

Looking at FastCSV, it states it can only use a certain dialect. If so, we probably can't use it, as csvlint.rb supports all sorts of different dialects; tab-separated, for example. Is that right?

JeniT commented 9 years ago

Might there be a way of using it only in the case where the dialect is appropriate?

Floppy commented 9 years ago

Maybe, yes.

jpmckinney commented 9 years ago

Another option is to figure out how to implement multiple dialects in FastCSV :) https://github.com/opennorth/fastcsv/issues/2

Floppy commented 9 years ago

that too :)

jpmckinney commented 9 years ago

I've added support for dialects using:

:row_sep is still restricted to implementing the :auto default option.

Do any of the CSVs you've validated use options not yet supported by FastCSV?

pezholio commented 9 years ago

Thanks for the work on this so far! :+1:

We're hoping to get some funding later in the year for more detailed work on csvlint.rb and the accompanying web app, so this will be a great complement to our planned improvements :smile:

pezholio commented 9 years ago

Now we're doing streaming validation, any mention of row_sep is probably moot, so this might be worth investigating again. How's progress on FastCSV going?

pezholio commented 9 years ago

I've just tried a dumb drop in replacement on our current master branch (adding FastCSV to the bundle, requiring, then replacing CSV.parse_line with FastCSV.parse_line at https://github.com/theodi/csvlint.rb/blob/master/lib/csvlint/validate.rb#L135) and benchmarked like so:

require 'csvlint'
require 'benchmark'

url = "https://dl.dropboxusercontent.com/u/135665/LGADataTest.csv"

@benchmarks = []

6.times do
  benchmark = Benchmark.measure {
    validator = Csvlint::Validator.new(url)
  }
  puts benchmark
  @benchmarks << benchmark.real
end

average = @benchmarks.inject{ |sum, el| sum + el }.to_f / @benchmarks.size

puts "Average time taken - #{average}"

Without FastCSV I get the following:

 13.040000   0.280000  13.320000 ( 17.225819)
 13.940000   0.270000  14.210000 ( 17.271949)
 13.240000   0.240000  13.480000 ( 16.326915)
 12.320000   0.160000  12.480000 ( 14.888282)
 12.760000   0.190000  12.950000 ( 19.387910)
 13.180000   0.180000  13.360000 ( 16.071267)
Average time taken - 16.86202366666667

With FastCSV I get a further reduction in performance, which I didn't expect:

 13.370000   0.560000  13.930000 ( 17.802677)
 13.230000   0.490000  13.720000 ( 16.509611)
 13.170000   0.500000  13.670000 ( 17.079451)
 14.520000   0.640000  15.160000 ( 24.219318)
 13.290000   0.530000  13.820000 ( 16.274699)
 13.610000   0.540000  14.150000 ( 16.667552)
Average time taken - 18.092218

I could of course, be doing something dumb, any ideas?

jpmckinney commented 9 years ago

Using a URL means your tests will be thrown off by differences in download speeds. I instead used path = 'LGADataTest.csv' and validator = Csvlint::Validator.new(File.new(path)), but I get invalid byte sequence in UTF-8.

Also, looking at the code, you are doing some CSV parsing in Ruby (the parse_line method). This is the sort of thing that FastCSV made fast.

Instead of substituting FastCSV at this point, I recommend using stackprof to find out where the real bottlenecks are.

jpmckinney commented 9 years ago

As you can see in the call graph in #157, csvlint spends only 5% of its time in CSV#shift, which is the operation that FastCSV optimizes, so you won't see a dramatic improvement until the rest of csvlint is optimized.