catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Prioritize GCS Cache over Zenodo API during CI #1679

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

Direct access to the Zenodo API is pretty slow and flaky. We have a cache of the raw data stored on GCS, and the Datastore can prioritize accessing that instead, while creating a local cache on the GitHub runner, which will stick around for 2 weeks. We may be able to set that cache to last longer too. Switching to using the GCS Cached version of our raw inputs would incur a small data egress expense, whenever the GitHub runner cache expires and it has to be repopulated, but would make our tests much less likely to fail because of fragile external API issues.

In combination with the EIA API issues outlined in #1343 these Zenodo API failures are now resulting in CI failing the majority of the time, without anything being wrong with our code, which seriously degrades the usefulness of running CI.

Tasks

zaneselvans commented 2 years ago

As far as I can tell from the GitHub Actions caching documentation the cache eviction policy is hard coded:

GitHub will remove any cache entries that have not been accessed in over 7 days. There is no limit on the number of caches you can store, but the total size of all caches in a repository is limited to 10 GB.

bendnorman commented 2 years ago

This issues was closed by #1858.