datakind / new-america-housing-loss-public

MIT License
10 stars 5 forks source link

Efficiency enhancement for zip-to-tract lookup #16

Closed manusharma50 closed 2 years ago

manusharma50 commented 2 years ago

This commit substantially enhances the efficiency of the probabilistic zip-to-tract lookup. This lookup utilizes the HUD crosswalk API to get a list of all GEOIDs contained in or overlapping a given zipcode, and then generates a random number to select amongst these possibilities based on what proportion of the zipcode's addresses lie in that GEOID.

The original code called the API and did this calculation for each individual dataset row that was not geocoded by the Census Geocoder. However, this also meant that each unique zipcode was potentially looked up more than once (and sometimes WAY more than once), resulting in a very slow process due to the API call overhead.

This enhancement only looks up unique zipcodes from the dataset rows missing GEOIDs and stores the associated API return results in a dictionary. The GEOID selection calculation is still done more than once to maintain the probabilistic assignment of GEOIDs for that zipcode, but since the API call overhead is eliminated, the efficiency gain from this is dramatic (of the order of 10-20x faster).

TODO: Need to update the unit test for this piece of functionality.