Closed pkarman closed 8 years ago
@ultrasaurus I think the main obstacle here is the lack of :force_utf8
and whether this project needs to remain committed to that fix-up feature. I know that contemplating encodings can be prohibitive for some folks, and I am not sure if the target audience for this project can/should be expected to understand enough about encodings to fix their CSV data before using this code. I would be happy to provide a documentation patch to this PR detailing how to check for and convert non-UTF8 encodings, if you think that is wise.
hey @pkarman I rebased this and am testing now
I think it would be fine to say we have to run some tool on the files ahead of time to make sure they are ok. But right now with the code, I'm getting this error:
read_from_s3 failed: ed-college-choice-dev MERGED2013_PP.csv with Encoding::UndefinedConversionError:"\xEF" from ASCII-8BIT to UTF-8
I will push the rebased branch
@pkarman let's move conversation over to: https://github.com/18F/open-data-maker/pull/297
supersedes #295
based on #294 so merge that first please
re-implements #94
This PR cuts indexing time significantly when run on a machine with multiple processors, at the expense of more memory usage. To mitigate the additional memory use that comes with parallel (forked) processes, CSV files are read via IO stream, a row at a time, rather than being slurped entirely into memory and parsed as a String.
For reference, the current import consumes about 250MB for the single process.
Example stats and usage:
An important caveat with reading CSV as stream: the :force_utf8 feature does not work as currently implemented. My recommendation is to require all CSV files be in UTF-8 format prior to import, possibly offering some easy docs/script to convert encodings.