catalyst-cooperative / pudl-archiver

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.
MIT License
4 stars 2 forks source link

Create github action for running the scraping/archiving process at desired frequencies #2

Closed zschira closed 1 year ago

zschira commented 1 year ago

@zschira commented on Tue Sep 13 2022

Once the archiver/scraper repos have been combined, and we have high level scripts for managing the process, it should be very easy to create github actions for automating the archiving process. New data is released at various frequencies for the different data sources incorporated in PUDL, so we can create multiple actions that run at frequencies reflective of this.


@zaneselvans commented on Tue Sep 13 2022

I am so excited for this to finally happen!

jdangerx commented 1 year ago

OK, so my understanding of this is that when we run pudl_archiver eia860:

If so, then what we basically need to do is:

@zschira - is my understanding of the archiver flow correct? And also - does the action plan sound reasonable here?

zschira commented 1 year ago

we create a new major version in the Zenodo concept

The terminology here is a little confusing (and we should probably provide better docs), but the concept DOI will always point to the latest version of a dataset, while a deposition refers to a single version.

the new version already has everything from the old version, so we look at the files in the old version and compare with our freshly-downloaded set:

  • anything deleted? delete it
  • anything added? add it
  • anything changed via checksum? update it

then, if nothing changed, we abandon the update (do we need to discard the draft somehow?) - otherwise, we tell Zenodo to actually publish the new version

This is correct. I think ideally we would discard the draft, however I've found the zenodo api to act unexpectedly when trying to do that, so instead we just reuse this draft during the next run.