Closed dshemetov closed 3 years ago
I looked at this over the weekend, and the entire data/ folder is 6.5mb (and once loaded into memory should be a bit less). Is there a good reason not to just load everything into memory on initialization and cut out a ton of complexity from the module?
I looked at this over the weekend, and the entire data/ folder is 6.5mb (and once loaded into memory should be a bit less). Is there a good reason not to just load everything into memory on initialization and cut out a ton of complexity from the module?
@chinandrew ah, tbh, I inherited the delayed loading functionality and didn't think to test whether it was really needed. Since it's just 6.5mb, that should be quickly loadable for sure!
Turns out it actually becomes bigger in memory due to pandas weirdness, but I think the difference between loading them all and loading only 3-4 for the usual geos (basically the addition of the zips for non-zip indicators) is still in the 10's of mb.
Oh interesting about pandas. But yea, the memory size isn't much of a concern as much as the speed, but even the speed seems to be fast enough. The main timeloss will be in loading the big tables for indicators that don't need them, but loading a 1.4mb csv is essentially instant, so I can get on board prioritizing clarity.
We load crosswalk files on demand in the geocode utility because they can be large and slow to load. This functionality can likely be outsourced to a memoize-to-disk library like joblib and remove code complexity, such as manually tracking the filenames for the crosswalk files.
This functionality can simplify extensions to fast crosswalks: for example, to convert from jhu_uid to zip, we currently need to crosswalk jhu_uid to fips and then fips to zip, whereas we could store the jhu_uid to zip mapping directly after computing it once.