18F / open-data-maker

make it easy to turn a lot of potentially large csv files into easily accessible open data
Other
200 stars 135 forks source link

parallel indexing, read CSV as stream #296

Closed pkarman closed 8 years ago

pkarman commented 8 years ago

supersedes #295

based on #294 so merge that first please

re-implements #94

This PR cuts indexing time significantly when run on a machine with multiple processors, at the expense of more memory usage. To mitigate the additional memory use that comes with parallel (forked) processes, CSV files are read via IO stream, a row at a time, rather than being slurped entirely into memory and parsed as a String.

For reference, the current import consumes about 250MB for the single process.

Example stats and usage:

# for 1000 rows from 4 files (using college-choice-data)

NPROCS=1 rake import
#812 sec (memory 150MB)

NPROCS=4 CHUNK_SIZE=50 rake import
#396 sec (memory 180MB x 4 (720MB))

NPROCS=5 CHUNK_SIZE=100 rake import
#268 sec (memory 230MB x 5 (1150MB))

An important caveat with reading CSV as stream: the :force_utf8 feature does not work as currently implemented. My recommendation is to require all CSV files be in UTF-8 format prior to import, possibly offering some easy docs/script to convert encodings.

pkarman commented 8 years ago

@ultrasaurus I think the main obstacle here is the lack of :force_utf8 and whether this project needs to remain committed to that fix-up feature. I know that contemplating encodings can be prohibitive for some folks, and I am not sure if the target audience for this project can/should be expected to understand enough about encodings to fix their CSV data before using this code. I would be happy to provide a documentation patch to this PR detailing how to check for and convert non-UTF8 encodings, if you think that is wise.

ultrasaurus commented 8 years ago

hey @pkarman I rebased this and am testing now

ultrasaurus commented 8 years ago

I think it would be fine to say we have to run some tool on the files ahead of time to make sure they are ok. But right now with the code, I'm getting this error: read_from_s3 failed: ed-college-choice-dev MERGED2013_PP.csv with Encoding::UndefinedConversionError:"\xEF" from ASCII-8BIT to UTF-8

I will push the rebased branch

ultrasaurus commented 8 years ago

@pkarman let's move conversation over to: https://github.com/18F/open-data-maker/pull/297