Open Stephen-Gates opened 9 years ago
Our plan is to work on scalability and robustness mostly, so we can deal with larger datasets, and more of them. There are also a bunch of bugs to fix as well. We know what needs doing, our only problem is we need to find some funding to spend time doing it!
How much funding do you need?
That's a good question! We're putting together the proposal now, so hopefully we'll know that soon :)
We find Csvlint really helpful but it does choke on big files. Let me know if you need specific things tested as I have 2 staff playing with Csvlint, schemes and data packaging now and early in the new year. We plan to use them as part of our standard publishing process.
Yeah, the issue with large files is certainly something we want to find a fix for as a priority.
We're about to throw some big csv's at CSVLint. Are you aware of a physical limit for file sizes?
Is there any advice you can offer on what file sizes csvlint currently can handle and should be able to handle in the future? If it helps, I've done a small bit of testing and found that on my PC, as soon as the file gets over 750kb in size the wait times start to blow out. Under that size, wait times are a very respectable 10-20 seconds. For the larger files the checker just continues to run and I've let it go for well over an hour in one case.
I would say that, currently, anything over 2mb, CSVlint will struggle. This is to do with the way the CSV is processed by saving it directly into MongoDB, which we've found is not performant. Once we identify some funding, this will probably be the first on our wishlist to fix.
Honestly as currently implemented we don't find CSVLint useful, although it has tons of promise. One of the key limitations is that the tool doesn't report row numbers on data errors. "The data in column 7 is inconsistent with others values in the same column" is too vague when you may have hundreds of rows.
I just ran a 17-record test CSV with 17 different errors that violate my schema. CSVLint reported two warnings. And this github seems to be stagnant. I love the schema format and regex support, but I'm not a programmer so I can't help :( I'll post my observations in a separate thread.
Does anyone have any alternative resources or tools to suggest for CSV validating?
Thanks for commenting! We've not had chance to develop this for a while, I admit, but we should be able to get some time to work on this very soon! Please create tickets for anything you think needs to be improved, with links to the validations that didn't work as you expected - that would be really helpful :)
@scrybbler we've moved onto Good Tables and FME:
@scrybbler I've added comments to your other issues. csvlint does try to report on all errors with detailed diagnostics, except for when the schema fails to parse and load! I've added an issue #186 to address this, it should resolve some frustrations when using the service.
@Stephen-Gates what were your reasons for migrating, was it related to file sizes or were there other problems or limitations? As @Floppy says, it'd be useful for us to know as we plan for any further work.
File size and lack of any visible progress.
Thank you, Stephen and Leigh! Yes, file size is definitely an issue for us too. Our biggest data sets are 30000+ records.
Just as an update, we've finally managed to get some time (and money) to spend on CSVlint, so I'm hoping that you'll see some progress in the next few weeks on a lot of this stuff.
Ok cool - I'll keep watching
What is your criteria from moving from Alpha to Beta?