Closed zaneselvans closed 1 year ago
@zaneselvans @bendnorman was this sufficiently addressed by https://github.com/catalyst-cooperative/pudl/pull/2150 ?
I think this was partially solved by #2150. For Readthedocs to authenticate with GCP we need to pass the service account credentials to RTD with an env var then write the contents of the var to a file gcloud
can authenticate with. Another option is to create an s3 bucket for our zenodo cache (I need to create an issue for this) because AWS credentials are passed around as env vars instead of a json file.
I know there are better ways of authenticating cloud services like workload identity federation or Okta but I don't understand how they work. @jdangerx do you have any experience with this type of authentication?
We could also maybe use service_account.Credentials.from_service_account_info()
(docs, example) - so in resource_cache.py:138
where we have
class GoogleCloudStorageCache(AbstractCache):
"""Implements file cache backed by Google Cloud Storage bucket."""
def __init__(self, gcs_path: str, **kwargs: Any):
"""Constructs new cache that stores files in Google Cloud Storage.
Args:
gcs_path (str): path to where the data should be stored. This should
be in the form of gs://{bucket-name}/{optional-path-prefix}
"""
super().__init__(**kwargs)
parsed_url = urlparse(gcs_path)
if parsed_url.scheme != "gs":
raise ValueError(f"gsc_path should start with gs:// (found: {gcs_path})")
self._path_prefix = Path(parsed_url.path)
# Get GCP credentials and billing project id
# A billing project is now required because zenodo-cache is requester pays.
> credentials, project_id = google.auth.default()
self._bucket = storage.Client(credentials=credentials).bucket(
parsed_url.netloc, user_project=project_id
)
We could replace that with a check for an env var like RTD_GOOGLE_CREDS
and do something like
import json
import os
from google.oauth2 import service_account
...
if creds_json := os.getenv("RTD_GOOGLE_CREDS"):
credentials = service_account.from_service_account_info(json.loads(creds_json))
project_id = "catalyst-cooperative-pudl"
else:
credentials, project_id = google.auth.default()
I couldn't find anything for RTD to use workload identity federation to log into gcloud. I guess in theory we could muck around with our own OAuth thing within the RTD build environment? Seems like way more effort.
It does seem like we have appetite for a Zenodo cache in S3, though - given the whole free hosting thing. So maybe this all becomes moot if that ends up happening, though I'm not sure where in our priorities that would lie...
I think there's a common pattern of base64 encoding JSON and binary credentials in an envvar or GitHub secret which we could maybe use.
If we automate the Zenodo archiving process, and split out the datastore as a usable standalone CLI, it would be extremely nice if we could cache all that data in the free AWS buckets. I do think we would hit the 100GB limit before too long, and need to ask for more space. There would be a lot of duplicated data since every DOI contains a full copy, even if the files aren't different from previous versions (Zenodo is smart about deduplicating this information on their backend and only storing a reference to the older version if the checksum hasn't changed). Unless there's some kind of automatic file-level deduplication setting that we could turn on for the bucket.
The docs builds have been extremely reliable since #2150 and the RTD credentials fix seems kind of annoying, so I think I'm going to close this and we can revisit if/when we have our raw inputs cached in the open data bucket on S3, or RTD gets flaky.
Currently when we build the documentation, we are still hitting Zenodo to download the
datapackage.json
files so we can see what data is covered by the data sources, but Zenodo is slow and flaky. If available, we should use local data first, then the GCS Cache, and only then fall back to Zenodo.The problematic call is in
pudl.metadata.classes.DataSource.add_datasource_metadata()
:That
Datastore()
instantiation needs to somehow get passed the rightlocal_cache_path
andgcs_cache_path
.This method is being called by
DataSource.to_rst()
which is being called by thedata_sources_metadata_to_rst()
function which we have defined indocs/conf.py
.Ideally, we would be able to get it the right credentials both in our CI / testing environment on GitHub, and also on ReadTheDocs, so neither of them are flaking out because of lost connections to Zenodo.