Refactor the geocode utility to use a disk caching utility

cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.

https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html

MIT License

12 stars 17 forks source link

Refactor the geocode utility to use a disk caching utility #282

Closed dshemetov closed 3 years ago

dshemetov commented 4 years ago

We load crosswalk files on demand in the geocode utility because they can be large and slow to load. This functionality can likely be outsourced to a memoize-to-disk library like joblib and remove code complexity, such as manually tracking the filenames for the crosswalk files.

This functionality can simplify extensions to fast crosswalks: for example, to convert from jhu_uid to zip, we currently need to crosswalk jhu_uid to fips and then fips to zip, whereas we could store the jhu_uid to zip mapping directly after computing it once.

chinandrew commented 3 years ago

I looked at this over the weekend, and the entire data/ folder is 6.5mb (and once loaded into memory should be a bit less). Is there a good reason not to just load everything into memory on initialization and cut out a ton of complexity from the module?

dshemetov commented 3 years ago

I looked at this over the weekend, and the entire data/ folder is 6.5mb (and once loaded into memory should be a bit less). Is there a good reason not to just load everything into memory on initialization and cut out a ton of complexity from the module?

@chinandrew ah, tbh, I inherited the delayed loading functionality and didn't think to test whether it was really needed. Since it's just 6.5mb, that should be quickly loadable for sure!

chinandrew commented 3 years ago

Turns out it actually becomes bigger in memory due to pandas weirdness, but I think the difference between loading them all and loading only 3-4 for the usual geos (basically the addition of the zips for non-zip indicators) is still in the 10's of mb.

dshemetov commented 3 years ago

Oh interesting about pandas. But yea, the memory size isn't much of a concern as much as the speed, but even the speed seems to be fast enough. The main timeloss will be in loading the big tables for indicators that don't need them, but loading a 1.4mb csv is essentially instant, so I can get on board prioritizing clarity.