datakind / new-america-housing-loss-public

MIT License
10 stars 5 forks source link

Zip-To-Census-Tract lookup looks up same zip multiple times #14

Closed jtanwk closed 2 years ago

jtanwk commented 2 years ago

Hello! When I ran the FEAT tool recently, I noticed the Zip-To-Census-Tract lookup repeated the lookup process for the same zip codes multiple times (truncated output below):

Using Zip-To-Census-Tract lookup for additional geocoding...
    ⌦  ERROR in Zip-Tract mapping for zip 32898, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 00000, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 00000, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 32010, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 32879, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 32834, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 00000, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 32705, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 32705, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 33822, Status code 404
    ⌦  ERROR in Zip-Tract mapping for zip 33822, Status code 404
        ...

Each lookup takes a decent amount of time (a few minutes per zip code at least). From looking at append_zip_to_tract_data() it looks like we're doing the lookup for all rows without a GEOID.

It might be more efficient if we got a list of all unique zip codes from those rows and only looked each up once, then merged the result back afterwards.

manusharma50 commented 2 years ago

Hi Jonathan,

Thanks for the feedback. Yes, you are correct we could most definitely make it more efficient, even though it does work fine for most use cases (the amount of data geocoded this way per city/county has not been huge so far). Also, since the assignment of census tract by zip code is probabilistic rather than deterministic, we'll have to rejigger some other parts of the code - again, not complex by any means, and likely something that can be addressed in the next release of the tool (New America and/or DataKind can shed more light on when that might be).

jtanwk commented 2 years ago

@manusharma50 thanks for the quick response! Yes, this is definitely in the nice-to-have category with respect to urgency. But anything that reduces processing time on a local machine is a win in my book.

manusharma50 commented 2 years ago

Ok my friend @jtanwk, your wish has been granted, as per this PR: https://github.com/datakind/new-america-housing-loss-public/pull/16. :)

jtanwk commented 2 years ago

@manusharma50 super exciting! My DK project team will be very excited by this.