Data-Liberation-Front / csvlint.rb

The gem behind http://csvlint.io
MIT License
283 stars 86 forks source link

Limit lines #101

Closed Hoedic closed 9 years ago

Hoedic commented 9 years ago

Currently csvlint.rb reads all the lines of the CSV files which can become very long for large files while most of the issues will be discovered in the first 10,000 or 100,000 lines (in files with potentially millions of records).

The proposed change add a supported parameter (limitLines) in the configuration dialect to limit the maximum number of lines evaluated, allowing to control the execution duration. (could even be useful for the web instance csvlint.io where most users will think that no output will come.

coveralls commented 9 years ago

Coverage Status

Coverage remained the same when pulling fdf7a0153737bf6e7475fc398483d5ca46f9004d on Hoedic:limit_lines into f926eb5ff5114aa1b6a4dcf20b2dc8112ced139a on theodi:master.

pezholio commented 9 years ago

I'm going to leave this for now. Most of the time we'll want to analyse all of the CSV, as some errors like inconsistent_values could appear anywhere within the file. That said, we do have an ambition to take http://csvlint.io to beta at some point in the new year, so I'll leave this open for reinvestigation when we get time.

Thanks for the PR though, great that people are finding this useful! :smile: :+1:

jpmckinney commented 9 years ago

@pezholio What's the harm in adding an option to limit the number of lines? If the option is not set, then the whole file is analyzed.

While it's possible for a CSV to have all its bad rows in the last 10% of the file, bad rows in real CSVs tend to be spread throughout a file.

Floppy commented 9 years ago

I've finally got to looking at this, and it's a great pile of stuff, thanks!

Yes, adding a limit is perfectly fine as an option, as it won't affect existing users unless they choose to use it.

Only thing is, the dialect option mirrors CSVDDF, so I think I'd rather not add it to that, but instead to a separate options hash at the end of the initialize parameter list. Does that make sense?

I think we should:

I will pull across the changes from your branch and open separate PRs for those, then get it all merged.

Floppy commented 9 years ago

Ah, my mistake, I see the speed fixes are already in, but obviously this branch just needs a merge from master.

Floppy commented 9 years ago

I'll merge this, and then change to a separate option.

Hoedic commented 9 years ago

Thanks @Floppy.

Sorry for having the speed improvement already in my branch. I initially planned to only have the limit change in my branch but I was already working with the speed improvement and noticed it only when the PR was already sent.

Floppy commented 9 years ago

No worries, sorry I got confused - I hadn't had coffee yet :)