PalmBeachPost / postgeo

Geocode CSVs and jitter overlapping points
MIT License
23 stars 3 forks source link

Consider subprocesses to vastly speed up geocoding #25

Open stucka opened 7 years ago

stucka commented 7 years ago

If we're paying for geocoding, we should be able to get more results faster. Should.

This depends on cache bug to be fixed, among others, in case we have a crash. Robustness of crash protection will need to be tested.

stucka commented 7 years ago

Branch created; performance looks to be about 10x. Still need to: -- Clean up. -- Set file handle flush and disk writes to maybe every 100 rows or so, to decrease I/O lag. Grab both at the same time. -- Fix creds.py to default to current directory. -- Test, test, test, test. 3.6 compatibility fixes may have broken 2.7. Still not sure where this leaves us with similarity of issues on https://github.com/PalmBeachPost/postgeo/issues/7 on parallel project, which I thought was a Mac difference and may really be a Python 3 v. Python 2 difference. -- Testing specifics include: Do we have the same effects in Python 2 as Python 3? Does it work from scratch in Linux and Windows? Can we find someone with a Mac? -- Seek testers from PythonJournos, maybe?

stucka commented 7 years ago

Variables are probably going to need to use locks for safety: http://effbot.org/zone/thread-synchronization.htm

Instead of writing to -geo and cache files directly with csvwriter, we probably ought to add them to a Queue.Queue and dump them periodically, maybe every couple hundred rows, to the CSVs, then flush and disk writes.

Weeeeee.