The analysis already looks on S3 for cached census data files (jobs and blocks), but there's no facility for actually filling that cache. This makes the analysis scripts that download and use those files also push them up to the S3 cache so they'll be there for the next analysis run. The files don't change (new versions would be released as new files, with a newer year in the filenames), so we don't need to worry about rotating or invalidating the cache.
This is in place of PR #809, which has the advantage of providing a way to load the cache with all the files for all states, but has the disadvantage of adding a lot more code--two Python scripts that are similar but not the same, plus two driver scripts. This achieves the same end as far as what the analysis does--using cached files if they're there and uploading them if they're not--within the existing bash scripts.
Just caching the files after first use will make a big difference to the rate of analysis failures from census.gov errors, but it would leave the door open for the first few runs in any given state to have problems. So there's value in preloading the cache. But not necessarily enough value to justify writing another script to do it. So I took what seemed like the path of least resistance, even though it's an odd road: I ran the scripts from PR #809 (scripts/run-blocks-cache-update and scripts/run-lodes-cache-update), which saved all the files to my development S3 bucket, then I downloaded them from there and uploaded them to the production bucket. So the production cache is loaded, but if we want to do it again someday we'll have to either check out the branch with those scripts (https://github.com/azavea/pfb-network-connectivity/tree/feature/kak/cache-census-s3%23786) or write a new script to do it.
Resolves #786
Testing Instructions
Check that your cache directory is empty: aws --profile pfb s3 ls s3://USERNAME-pfb-storage-us-east-1/data/ (or at least see what's in it so you'll know what changes)
Run an analysis
You can do a small one, or you can cancel it out once it's done with the jobs and blocks import steps to tying up resources for a long analysis run.
If you do a South Dakota or Alaska one, you'll see the "fall back to 2016" behavior for the LODES files.
Check that the three files used by that analysis are now in your cache directory on S3.
Run another analysis for the same state (same neighborhood is fine) and confirm that it gets the Census files from S3 rather than downloading them again from the source.
Overview
The analysis already looks on S3 for cached census data files (jobs and blocks), but there's no facility for actually filling that cache. This makes the analysis scripts that download and use those files also push them up to the S3 cache so they'll be there for the next analysis run. The files don't change (new versions would be released as new files, with a newer year in the filenames), so we don't need to worry about rotating or invalidating the cache.
This is in place of PR #809, which has the advantage of providing a way to load the cache with all the files for all states, but has the disadvantage of adding a lot more code--two Python scripts that are similar but not the same, plus two driver scripts. This achieves the same end as far as what the analysis does--using cached files if they're there and uploading them if they're not--within the existing bash scripts.
Just caching the files after first use will make a big difference to the rate of analysis failures from census.gov errors, but it would leave the door open for the first few runs in any given state to have problems. So there's value in preloading the cache. But not necessarily enough value to justify writing another script to do it. So I took what seemed like the path of least resistance, even though it's an odd road: I ran the scripts from PR #809 (
scripts/run-blocks-cache-update
andscripts/run-lodes-cache-update
), which saved all the files to my development S3 bucket, then I downloaded them from there and uploaded them to the production bucket. So the production cache is loaded, but if we want to do it again someday we'll have to either check out the branch with those scripts (https://github.com/azavea/pfb-network-connectivity/tree/feature/kak/cache-census-s3%23786) or write a new script to do it.Resolves #786
Testing Instructions
aws --profile pfb s3 ls s3://USERNAME-pfb-storage-us-east-1/data/
(or at least see what's in it so you'll know what changes)Checklist