re-implements #94
supersedes #295 and #296 (rebased on dev)
This PR cuts indexing time significantly when run on a machine with multiple processors, at the expense of more memory usage. To mitigate the additional memory use that comes with parallel (forked) processes, CSV files are read via IO stream, a row at a time, rather than being slurped entirely into memory and parsed as a String.
For reference, the current import consumes about 250MB for the single process.
Example stats and usage:
# for 1000 rows from 4 files (using college-choice-data)
NPROCS=1 rake import
#812 sec (memory 150MB)
NPROCS=4 CHUNK_SIZE=50 rake import
#396 sec (memory 180MB x 4 (720MB))
NPROCS=5 CHUNK_SIZE=100 rake import
#268 sec (memory 230MB x 5 (1150MB))
An important caveat with reading CSV as stream: the :force_utf8 feature does not work as currently implemented. My recommendation is to require all CSV files be in UTF-8 format prior to import, possibly offering some easy docs/script to convert encodings.
re-implements #94 supersedes #295 and #296 (rebased on dev)
This PR cuts indexing time significantly when run on a machine with multiple processors, at the expense of more memory usage. To mitigate the additional memory use that comes with parallel (forked) processes, CSV files are read via IO stream, a row at a time, rather than being slurped entirely into memory and parsed as a String.
For reference, the current import consumes about 250MB for the single process.
Example stats and usage:
An important caveat with reading CSV as stream: the :force_utf8 feature does not work as currently implemented. My recommendation is to require all CSV files be in UTF-8 format prior to import, possibly offering some easy docs/script to convert encodings.