Handle large files without causing app to break

newsroomdev commented 8 years ago

Summary

Right now, we have an issue with anyone uploading a CSV. The app will freeze and then crash. This was pointed out by @chriszs after he tried to upload a 170mb file.

Steps to reproduce

This error can be reproduced if you use larget_tweet_data.csv

Expected behavior

Tests should run and results should still render, or a message needs to appear that we can't handle large files yet.

Possible fixes?

We're currently still in beta testing and not sure how many users will want to run this on significantly large spreadsheets.

Cap the maximum number of rows we can read in (short-term)
Add some electron workers (long-term)

Stephen-Gates commented 8 years ago

I just attempted to run tests on a large csv (~168MB, 150,000 rows). It worked but a visual indicator (progress bar?) to show it's working would be nice.

ejfox commented 8 years ago

@Stephen-Gates That's great and really interesting feedback. Do you know roughly how long it took, and what type of computer you're on?

I think for the time being, we are simply going to process the first 1024 rows as a sample. Reworking the app the use electron workers isn't feasible given our current launch deadline, but I would like to revisit this really soon!

Stephen-Gates commented 8 years ago

@ejfox It took about 45 seconds using all tests except Geographic Data Tests using...

Currently we use Open Knowledge's Good Tables (repo) or ODI's CSVLint (repo) to check our CSV files but DataProofer has some different tests that are useful. Would be really nice to see a combined set of tests.

enjalot commented 8 years ago

Some thoughts on how we could proceed

Short-term mitigations

randomly sample 1000 rows for automatic testing & instant results
for row-based plain text formats (CSV, TSV etc) we could use fs to read only the first 1k lines, reducing lag and memory usage on load.

Real solutions

Use Papa Parse to stream CSVs http://papaparse.com/
Move running tests to Web Workers OR
Move running tests to node side where we can even run each suite in its own child process

We need to think carefully about streaming rows (papa parse), it seems like a harder rearchitecting with more fallout than streaming tests (running 1 test at a time and updating the results). Streaming rows would break the way we have tests setup now, as they expect the entire list of rows (necessary for things like outliers that need to know the median/mean of the whole dataset)

I'm also in favor of moving the file processing & test running to the node process. This will make it much more natural to extend to an online webapp, meaning we could reuse 90% of the front-end code (as well as the existing test-running engine).
An important consideration with this change though would be the way we manage the tests in the local suite (they are hot-loaded in the browser from a text string rather than being required like normal node modules)

Another note re: 1000 rows, when we build in support for larger files we should still always show the sample, and then have UI that indicates the progress of the full dataset processing.

dataproofer / Dataproofer