Data-Liberation-Front / csvlint.io

Check that your CSV files are valid
http://csvlint.io
MIT License
73 stars 12 forks source link

Alpha to Beta? #174

Open Stephen-Gates opened 9 years ago

Stephen-Gates commented 9 years ago

What is your criteria from moving from Alpha to Beta?

Floppy commented 9 years ago

Our plan is to work on scalability and robustness mostly, so we can deal with larger datasets, and more of them. There are also a bunch of bugs to fix as well. We know what needs doing, our only problem is we need to find some funding to spend time doing it!

Stephen-Gates commented 9 years ago

How much funding do you need?

Floppy commented 9 years ago

That's a good question! We're putting together the proposal now, so hopefully we'll know that soon :)

Stephen-Gates commented 9 years ago

We find Csvlint really helpful but it does choke on big files. Let me know if you need specific things tested as I have 2 staff playing with Csvlint, schemes and data packaging now and early in the new year. We plan to use them as part of our standard publishing process.

pezholio commented 9 years ago

Yeah, the issue with large files is certainly something we want to find a fix for as a priority.

Stephen-Gates commented 9 years ago

We're about to throw some big csv's at CSVLint. Are you aware of a physical limit for file sizes?

TacoSandwich commented 9 years ago

Is there any advice you can offer on what file sizes csvlint currently can handle and should be able to handle in the future? If it helps, I've done a small bit of testing and found that on my PC, as soon as the file gets over 750kb in size the wait times start to blow out. Under that size, wait times are a very respectable 10-20 seconds. For the larger files the checker just continues to run and I've let it go for well over an hour in one case.

pezholio commented 9 years ago

I would say that, currently, anything over 2mb, CSVlint will struggle. This is to do with the way the CSV is processed by saving it directly into MongoDB, which we've found is not performant. Once we identify some funding, this will probably be the first on our wishlist to fix.

scrybbler commented 9 years ago

Honestly as currently implemented we don't find CSVLint useful, although it has tons of promise. One of the key limitations is that the tool doesn't report row numbers on data errors. "The data in column 7 is inconsistent with others values in the same column" is too vague when you may have hundreds of rows.

I just ran a 17-record test CSV with 17 different errors that violate my schema. CSVLint reported two warnings. And this github seems to be stagnant. I love the schema format and regex support, but I'm not a programmer so I can't help :( I'll post my observations in a separate thread.

Does anyone have any alternative resources or tools to suggest for CSV validating?

Floppy commented 9 years ago

Thanks for commenting! We've not had chance to develop this for a while, I admit, but we should be able to get some time to work on this very soon! Please create tickets for anything you think needs to be improved, with links to the validations that didn't work as you expected - that would be really helpful :)

Stephen-Gates commented 9 years ago

@scrybbler we've moved onto Good Tables and FME:

ldodds commented 9 years ago

@scrybbler I've added comments to your other issues. csvlint does try to report on all errors with detailed diagnostics, except for when the schema fails to parse and load! I've added an issue #186 to address this, it should resolve some frustrations when using the service.

ldodds commented 9 years ago

@Stephen-Gates what were your reasons for migrating, was it related to file sizes or were there other problems or limitations? As @Floppy says, it'd be useful for us to know as we plan for any further work.

Stephen-Gates commented 9 years ago

File size and lack of any visible progress.

scrybbler commented 9 years ago

Thank you, Stephen and Leigh! Yes, file size is definitely an issue for us too. Our biggest data sets are 30000+ records.

Floppy commented 9 years ago

Just as an update, we've finally managed to get some time (and money) to spend on CSVlint, so I'm hoping that you'll see some progress in the next few weeks on a lot of this stuff.

Stephen-Gates commented 9 years ago

Ok cool - I'll keep watching