Closed Hoedic closed 9 years ago
I'm going to leave this for now. Most of the time we'll want to analyse all of the CSV, as some errors like inconsistent_values
could appear anywhere within the file. That said, we do have an ambition to take http://csvlint.io to beta at some point in the new year, so I'll leave this open for reinvestigation when we get time.
Thanks for the PR though, great that people are finding this useful! :smile: :+1:
@pezholio What's the harm in adding an option to limit the number of lines? If the option is not set, then the whole file is analyzed.
While it's possible for a CSV to have all its bad rows in the last 10% of the file, bad rows in real CSVs tend to be spread throughout a file.
I've finally got to looking at this, and it's a great pile of stuff, thanks!
Yes, adding a limit is perfectly fine as an option, as it won't affect existing users unless they choose to use it.
Only thing is, the dialect
option mirrors CSVDDF, so I think I'd rather not add it to that, but instead to a separate options hash at the end of the initialize parameter list. Does that make sense?
I think we should:
I will pull across the changes from your branch and open separate PRs for those, then get it all merged.
Ah, my mistake, I see the speed fixes are already in, but obviously this branch just needs a merge from master.
I'll merge this, and then change to a separate option.
Thanks @Floppy.
Sorry for having the speed improvement already in my branch. I initially planned to only have the limit change in my branch but I was already working with the speed improvement and noticed it only when the PR was already sent.
No worries, sorry I got confused - I hadn't had coffee yet :)
Currently csvlint.rb reads all the lines of the CSV files which can become very long for large files while most of the issues will be discovered in the first 10,000 or 100,000 lines (in files with potentially millions of records).
The proposed change add a supported parameter (limitLines) in the configuration dialect to limit the maximum number of lines evaluated, allowing to control the execution duration. (could even be useful for the web instance csvlint.io where most users will think that no output will come.