catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Improve and Automate raw data archiving/access #1418

Closed bendnorman closed 1 month ago

bendnorman commented 2 years ago

Description

This Epic tracks updates to the data archiving and access processes. The previous process for creating new archives involved first running the scraper to download new data locally. Next, the archiver could be used to upload new data to zenodo and create a new archive version. This manual process makes updating archives somewhat difficult, and requires someone being aware of upstream updates, which often leads to stale data. Combining the archiver and scrapers will not only simplify this process, but also make automation much easier.

Once new data archives are created, there is still no easy way to access these raw archives outside of PUDL. This is because the Datastore that PUDL uses for accessing these data archives is embedded within PUDL. Making the Datastore a standalone software package would allow accessing these archives from client projects, and by users.

Scope

- How do we know when we are done? This epic is done when dataset archives are updated automatically. - What is out of scope? Integrating specific datasets.

Tasks

Archiver

PUDL Integration

Create standalone Datastore

jdangerx commented 1 year ago

We had mentioned maybe "Try adding a new dataset and see if our automation picks it up and archives it" as the final definition of done - what do you think @zschira ? Or is that just part of catalyst-cooperative/pudl-archiver#2?

zaneselvans commented 1 year ago

Kick off nightly build to detect problems stemming from new data

I feel like there are 2 ways we could approach this.

bendnorman commented 1 month ago

Can this be closed?

zaneselvans commented 1 month ago

We should probably carve out the unfinished work in another issue or issues.

jdangerx commented 1 month ago

I've carved those out, minus the datastore thing, which is a persistent large thing we've been thinking about.

catalyst-cooperative/pudl-archiver#346 catalyst-cooperative/pudl-archiver#347

3639

Closing!