Create github action for running the scraping/archiving process at desired frequencies

zschira commented 1 year ago

@zschira commented on Tue Sep 13 2022

Once the archiver/scraper repos have been combined, and we have high level scripts for managing the process, it should be very easy to create github actions for automating the archiving process. New data is released at various frequencies for the different data sources incorporated in PUDL, so we can create multiple actions that run at frequencies reflective of this.

@zaneselvans commented on Tue Sep 13 2022

I am so excited for this to finally happen!

jdangerx commented 1 year ago

OK, so my understanding of this is that when we run pudl_archiver eia860:

we create a new major version in the Zenodo concept
- a concept is the same as a deposition?
the new version already has everything from the old version, so we look at the files in the old version and compare with our freshly-downloaded set:
- anything deleted? delete it
- anything added? add it
- anything changed via checksum? update it
- then, if nothing changed, we abandon the update (do we need to discard the draft somehow?) - otherwise, we tell Zenodo to actually publish the new version

If so, then what we basically need to do is:

every week, run every archiver (in case of mysterious changes in data that are unexpected)
- @zaneselvans I think you'd said a week - is that the right timeframe?
(maybe extract some metrics here? notify people about changes?)
(maybe refactor the CLI to pull the archive_dataset logic into a place that we can access programmatically?)

@zschira - is my understanding of the archiver flow correct? And also - does the action plan sound reasonable here?

zschira commented 1 year ago

we create a new major version in the Zenodo concept

The terminology here is a little confusing (and we should probably provide better docs), but the concept DOI will always point to the latest version of a dataset, while a deposition refers to a single version.

the new version already has everything from the old version, so we look at the files in the old version and compare with our freshly-downloaded set:

anything deleted? delete it

anything added? add it

anything changed via checksum? update it

then, if nothing changed, we abandon the update (do we need to discard the draft somehow?) - otherwise, we tell Zenodo to actually publish the new version

This is correct. I think ideally we would discard the draft, however I've found the zenodo api to act unexpectedly when trying to do that, so instead we just reuse this draft during the next run.

catalyst-cooperative / pudl-archiver

Create github action for running the scraping/archiving process at desired frequencies #2