Documentation build should not rely on Zenodo metadata

zaneselvans commented 1 year ago

Currently when we build the documentation, we are still hitting Zenodo to download the datapackage.json files so we can see what data is covered by the data sources, but Zenodo is slow and flaky. If available, we should use local data first, then the GCS Cache, and only then fall back to Zenodo.

The problematic call is in pudl.metadata.classes.DataSource.add_datasource_metadata():

def add_datastore_metadata(self) -> None:
    """Get source file metadata from the datastore."""
    dp_desc = Datastore(sandbox=False).get_datapackage_descriptor(self.name)
    partitions = dp_desc.get_partitions()
    if "year" in partitions:
        partitions["years"] = partitions["year"]
    elif "year_month" in partitions:
        partitions["year_month"] = max(partitions["year_month"])
    self.source_file_dict["source_years"] = self.get_temporal_coverage(partitions)
    self.source_file_dict["download_size"] = dp_desc.get_download_size()

That Datastore() instantiation needs to somehow get passed the right local_cache_path and gcs_cache_path.

This method is being called by DataSource.to_rst() which is being called by the data_sources_metadata_to_rst() function which we have defined in docs/conf.py.

Ideally, we would be able to get it the right credentials both in our CI / testing environment on GitHub, and also on ReadTheDocs, so neither of them are flaking out because of lost connections to Zenodo.

jdangerx commented 1 year ago

@zaneselvans @bendnorman was this sufficiently addressed by https://github.com/catalyst-cooperative/pudl/pull/2150 ?

bendnorman commented 1 year ago

I think this was partially solved by #2150. For Readthedocs to authenticate with GCP we need to pass the service account credentials to RTD with an env var then write the contents of the var to a file gcloud can authenticate with. Another option is to create an s3 bucket for our zenodo cache (I need to create an issue for this) because AWS credentials are passed around as env vars instead of a json file.

I know there are better ways of authenticating cloud services like workload identity federation or Okta but I don't understand how they work. @jdangerx do you have any experience with this type of authentication?

jdangerx commented 1 year ago

We could also maybe use service_account.Credentials.from_service_account_info() (docs, example) - so in resource_cache.py:138 where we have

 class GoogleCloudStorageCache(AbstractCache):
     """Implements file cache backed by Google Cloud Storage bucket."""

     def __init__(self, gcs_path: str, **kwargs: Any):
         """Constructs new cache that stores files in Google Cloud Storage.

         Args:
             gcs_path (str): path to where the data should be stored. This should
               be in the form of gs://{bucket-name}/{optional-path-prefix}
         """
         super().__init__(**kwargs)
         parsed_url = urlparse(gcs_path)
         if parsed_url.scheme != "gs":
             raise ValueError(f"gsc_path should start with gs:// (found: {gcs_path})")
         self._path_prefix = Path(parsed_url.path)
         # Get GCP credentials and billing project id
         # A billing project is now required because zenodo-cache is requester pays.
>        credentials, project_id = google.auth.default()
         self._bucket = storage.Client(credentials=credentials).bucket(
             parsed_url.netloc, user_project=project_id
         )

We could replace that with a check for an env var like RTD_GOOGLE_CREDS and do something like

import json
import os

from google.oauth2 import service_account

...

if creds_json := os.getenv("RTD_GOOGLE_CREDS"):
    credentials = service_account.from_service_account_info(json.loads(creds_json))
    project_id = "catalyst-cooperative-pudl"
else:
    credentials, project_id = google.auth.default()

I couldn't find anything for RTD to use workload identity federation to log into gcloud. I guess in theory we could muck around with our own OAuth thing within the RTD build environment? Seems like way more effort.

jdangerx commented 1 year ago

It does seem like we have appetite for a Zenodo cache in S3, though - given the whole free hosting thing. So maybe this all becomes moot if that ends up happening, though I'm not sure where in our priorities that would lie...

zaneselvans commented 1 year ago

I think there's a common pattern of base64 encoding JSON and binary credentials in an envvar or GitHub secret which we could maybe use.

If we automate the Zenodo archiving process, and split out the datastore as a usable standalone CLI, it would be extremely nice if we could cache all that data in the free AWS buckets. I do think we would hit the 100GB limit before too long, and need to ask for more space. There would be a lot of duplicated data since every DOI contains a full copy, even if the files aren't different from previous versions (Zenodo is smart about deduplicating this information on their backend and only storing a reference to the older version if the checksum hasn't changed). Unless there's some kind of automatic file-level deduplication setting that we could turn on for the bucket.

zaneselvans commented 1 year ago

The docs builds have been extremely reliable since #2150 and the RTD credentials fix seems kind of annoying, so I think I'm going to close this and we can revisit if/when we have our raw inputs cached in the open data bucket on S3, or RTD gets flaky.

catalyst-cooperative / pudl

Documentation build should not rely on Zenodo metadata #2149