18F / open-data-maker

make it easy to turn a lot of potentially large csv files into easily accessible open data
Other
199 stars 135 forks source link

read CSV as stream, process rows in parallel chunks #295

Closed pkarman closed 8 years ago

pkarman commented 8 years ago

Based off #294 and #94

This PR cuts indexing time significantly when run on a machine with multiple processors, at the expense of more memory usage. To mitigate the additional memory use that comes with parallel (forked) processes, CSV files are read via IO stream, a row at a time, rather than being slurped entirely into memory and parsed as a String.

For reference, the current import consumes about 250MB for the single process.

Example stats and usage:

# for 1000 rows from 4 files (using college-choice-data)

NPROCS=1 rake import
#812 sec (memory 150MB)

NPROCS=4 CHUNK_SIZE=50 rake import
#396 sec (memory 180MB x 4 (720MB))

NPROCS=5 CHUNK_SIZE=100 rake import
#268 sec (memory 230MB x 5 (1150MB))
pkarman commented 8 years ago

An important caveat with reading CSV as stream: the :force_utf8 feature does not work as currently implemented. My recommendation is to require all CSV files be in UTF-8 format prior to import, possibly offering some easy docs/script to convert encodings.

pkarman commented 8 years ago

Going to open a new PR against a branch directly on this repo.