dataproofer / Dataproofer

A proofreader for your data
http://dataproofer.org/
691 stars 53 forks source link

Handle large files without causing app to break #61

Closed newsroomdev closed 8 years ago

newsroomdev commented 8 years ago

Summary

Right now, we have an issue with anyone uploading a CSV. The app will freeze and then crash. This was pointed out by @chriszs after he tried to upload a 170mb file.

Steps to reproduce

This error can be reproduced if you use larget_tweet_data.csv

Expected behavior

Tests should run and results should still render, or a message needs to appear that we can't handle large files yet.

Possible fixes?

We're currently still in beta testing and not sure how many users will want to run this on significantly large spreadsheets.

Stephen-Gates commented 8 years ago

I just attempted to run tests on a large csv (~168MB, 150,000 rows). It worked but a visual indicator (progress bar?) to show it's working would be nice.

ejfox commented 8 years ago

@Stephen-Gates That's great and really interesting feedback. Do you know roughly how long it took, and what type of computer you're on?

I think for the time being, we are simply going to process the first 1024 rows as a sample. Reworking the app the use electron workers isn't feasible given our current launch deadline, but I would like to revisit this really soon!

Stephen-Gates commented 8 years ago

@ejfox It took about 45 seconds using all tests except Geographic Data Tests using...

screenshot 2016-04-07 21 12 51

Currently we use Open Knowledge's Good Tables (repo) or ODI's CSVLint (repo) to check our CSV files but DataProofer has some different tests that are useful. Would be really nice to see a combined set of tests.

enjalot commented 8 years ago

Some thoughts on how we could proceed

Short-term mitigations

Real solutions

We need to think carefully about streaming rows (papa parse), it seems like a harder rearchitecting with more fallout than streaming tests (running 1 test at a time and updating the results). Streaming rows would break the way we have tests setup now, as they expect the entire list of rows (necessary for things like outliers that need to know the median/mean of the whole dataset)

I'm also in favor of moving the file processing & test running to the node process. This will make it much more natural to extend to an online webapp, meaning we could reuse 90% of the front-end code (as well as the existing test-running engine).
An important consideration with this change though would be the way we manage the tests in the local suite (they are hot-loaded in the browser from a text string rather than being required like normal node modules)

Another note re: 1000 rows, when we build in support for larger files we should still always show the sample, and then have UI that indicates the progress of the full dataset processing.