18F / open-data-maker

make it easy to turn a lot of potentially large csv files into easily accessible open data
Other
200 stars 135 forks source link

Elasticsearch bulk index #297

Closed ultrasaurus closed 8 years ago

ultrasaurus commented 8 years ago

re-implements #94 supersedes #295 and #296 (rebased on dev)

This PR cuts indexing time significantly when run on a machine with multiple processors, at the expense of more memory usage. To mitigate the additional memory use that comes with parallel (forked) processes, CSV files are read via IO stream, a row at a time, rather than being slurped entirely into memory and parsed as a String.

For reference, the current import consumes about 250MB for the single process.

Example stats and usage:

# for 1000 rows from 4 files (using college-choice-data)

NPROCS=1 rake import
#812 sec (memory 150MB)

NPROCS=4 CHUNK_SIZE=50 rake import
#396 sec (memory 180MB x 4 (720MB))

NPROCS=5 CHUNK_SIZE=100 rake import
#268 sec (memory 230MB x 5 (1150MB))

An important caveat with reading CSV as stream: the :force_utf8 feature does not work as currently implemented. My recommendation is to require all CSV files be in UTF-8 format prior to import, possibly offering some easy docs/script to convert encodings.

ultrasaurus commented 8 years ago

@pkarman Agree on simple approach of modifying docs on how to run iconv or whatever to convert to UTF8