The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
We have several repositories that download the PUDL DB from the nightly build outputs for use in CI. Currently they have no way of knowing whether they actually need to download the DB. We can pull from the AWS Open Data buckets and that makes it free, but it would be faster just to not download if we don't need to (especially as the DB grows).
If the nightly builds generated a couple of caching keys alongside the DB this would be easy to do. Maybe just a couple of text files or a yaml file containing:
The commit hash associated with the nightly build (would allow cached DB only to update if the commit has changed)
The sha256 hash of pudl.sqlite (would allow cached DB to update any time the DB has changed at all, which could happen due to changes in our code, or our dependencies)
This could be done by adding some shell commands to our gcp_pudl_etl.sh script that's run in the Docker container for the nightly builds (and maybe also the local_pudl_etl.sh script for testing / development).
Then we would need to modify the caching step in the tox-pytest workflows in the repositories that download the nightly PUDL DB outputs to use these caching keys to determine whether a new DB should be downloaded. Right now these repos include at least:
We have several repositories that download the PUDL DB from the nightly build outputs for use in CI. Currently they have no way of knowing whether they actually need to download the DB. We can pull from the AWS Open Data buckets and that makes it free, but it would be faster just to not download if we don't need to (especially as the DB grows).
If the nightly builds generated a couple of caching keys alongside the DB this would be easy to do. Maybe just a couple of text files or a yaml file containing:
sha256
hash ofpudl.sqlite
(would allow cached DB to update any time the DB has changed at all, which could happen due to changes in our code, or our dependencies)This could be done by adding some shell commands to our
gcp_pudl_etl.sh
script that's run in the Docker container for the nightly builds (and maybe also thelocal_pudl_etl.sh
script for testing / development).Then we would need to modify the caching step in the
tox-pytest
workflows in the repositories that download the nightly PUDL DB outputs to use these caching keys to determine whether a new DB should be downloaded. Right now these repos include at least: