Write a data release script

zaneselvans commented 4 years ago

We're publishing / archiving our data releases at Zenodo, and they provide a RESTful API that we can access from within Python to automate the release process and to populate the Zenodo metadata fields, many of which we already compile for our own metadata.

In addition, we need to make sure that we generate the archived data in a controlled and reproducible way -- using a particular published release of the catalystcoop.pudl package, in a well-defined Python environment.

This will all be much more reliable and repeatable (and easier) if we have a script that does it for us the same way ever time. What all would such a script need to do?

Create a reproducible software environment: Install the most recently released version of catalystcoop.pudl in a fresh conda environment along with all of its dependencies, and record the entire list of installed python packages and their versions in an environment.yml or requirements.txt file.
Acquire fresh data: Download a fresh copy of all of the input data to be packaged, so we know what date it is from. Otherwise we all have potentially different collections of older downloads in our datastores, since sometimes agencies go back and alter the data after its initial publication.
Run the ETL process: Using the installed PUDL software and the fresh data, generate the bundle of tabular data packages to be released. This may include reserving one or more DOIs using the Zenodo API (See Issue #419).
Validate the packaged data: Populate a local SQLite DB with all of the data to be released, and run the full set of data validation tests on it. Any failures should be documented in the data release notes.
Create a compressed archive of the inputs: This is a .zip or .tgz of the datastore which was used in the ETL process, for archiving alongside the data packages, so that the process can be reproduced. Should this be one giant file? Or should we break it out by data source?
Create a compressed archive of the outputs: This is a .zip or .tgz of each of the individual data packages that's being archived. Keeping the data packages separate means people can download only the ones that they need/want.
Upload everything to Zenodo: If a DOI has been reserved earlier in the process, we'll need to make sure that we're uploading to that same archive. Stuff to archive includes:
- the compressed archive(s) of the input datastore.
- the environment.yml file defining the python environment that was used.
- the compressed archives of the output data packages.
- a script which, given the contents of the Zenodo archive, can perfectly re-create the archived data packages. This will probably be the same script that this issue is referring to.
Populate Zenodo metadata: Using existing metadata (and possibly some additional information that is given to the data release script as input), populate the metadata associated with the data release at Zenodo using their RESTful API.

zaneselvans commented 4 years ago

Another thing to maybe discuss here is at what granularity do we want to archive this data and assign DOIs?

We want users to be able to download data packages individually, rather than requiring everyone to pull down everything. This can be done either by archiving them separately on Zenodo, even though they've been generated as part of the same ETL process, or by keeping each of the data packages archived as separate files within the Zenodo archive.
We want to make it clear that all of the data packages generated as part of a single ETL run -- using the same input data and the same software environment -- are part of the same overall release of PUDL data and can be used together. If they're not in the same Zenodo archive, we'll need to link them together by their independent DOIs.
We want to keep track of breaking changes between releases, which in theory should be doable using the data package version numbers, in the individual data package metadata. Right now it seems like every release will be breaking... but hopefully it'll stabilize over time.

I've been leaning toward releasing all the data packages in the same Zenodo archive, so they are obviously linked to each other, and so there's less overhead in doing a data release, and less duplication in what's getting archived. In this scenario all the data packages would share the single DOI that's assigned to the overall Zenodo archive... but is that wrong? Is that an okay way to use the DOIs?

The alternative would be to generate all the data packages at the same time, and then push each of them up to their own Zenodo archives, with their own DOIs. If we have a script that can do that in an automated way, maybe it's not much more overhead, but it seems like it might be more confusing for users to navigate.

zaneselvans commented 4 years ago

@lwinfree I'm curious if you have thoughts on the correct way to use DOIs in connection with the published data packages, as discussed above..

zaneselvans commented 4 years ago

Inputs that we need to feed into this script:

The master PUDL data DOI at Zenodo -- this points at the archive as a whole, not any individual release, and always re-directs to the most recent version. We'll need this for reserving an appropriate version specific DOI.
An ETL settings file that defines:
- Data package bundle name
- Data package names / titles / descriptions
- Data package versions
- Data package contents (years, states, tables, etc.)

The ETL settings file should allow us to automatically:

Download the required data to a local data store.
Run the ETL process, generating data packages.
Reserve an appropriate versioned DOI for the data package bundle to be archived.

If there's additional metadata that we need to populate at Zenodo, where should that live? Should we make sure that all of it is available within the datapackage metadata? Should it be stored in the ETL settings file? Should it be in its own YAML or JSON file?

lwinfree commented 4 years ago

Hi @zaneselvans ! OK I think I am leaning towards:

I've been leaning toward releasing all the data packages in the same Zenodo archive, so they are obviously linked to each other, and so there's less overhead in doing a data release, and less duplication in what's getting archived. In this scenario all the data packages would share the single DOI that's assigned to the overall Zenodo archive... but is that wrong? Is that an okay way to use the DOIs?

Breaking it down, I think the main archive should definitely have a DOI. I think it makes sense to have all the data packages in the same archive. I understand that if all the datapackages have the same DOI (that is, the archive DOI), then it could be confusing. But, could you generate a separate UUID for each datapackage during the ETL? In the ID field? So that the whole archive has a DOI but each datapackage is identifiable with its UUID?

zaneselvans commented 4 years ago

Right now the UUID identifies the ETL run that generated the data packages -- every run gets its own UUID which is stamped in all the data packages it outputs, and this UUID is used to verify inter-package compatibility any time the data packages are used together -- like when all the resources are combined into a single data package and that package is loaded into a database (since the ETL parameters could change between runs, and we don't want packages which were generated under different circumstances to be accidentally used together).

In the case where each data package (EPA CEMS, FERC Form 1, EIA 860/923) is archived as its own resource on Zenodo, there would be no "main" archive -- each of them would have their own master DOI, and their own lineage of versioned DOIs, and the only things linking particular versions of those packages together would be the ETL UUID, and any Zenodo level metadata pointing at other specific versioned DOIs for the other packages which are compatible with an individual package.

I guess another option would be to do both of these things -- have independent CEMS, FERC1, and EIA data packages, with their own DOIs, and also have a master PUDL Data Release archive, which has its own DOI, and which contains another copy of each of the independent data packages, with its own personal DOI stamped inside it. But that level of duplication seems kinda silly.

Another another option (which is maybe what you were suggesting) is that the individual data packages could just... not have DOIs in them at all and instead be identified by UUIDs alone. There could be a UUID in the id field, and another UUID in the datapkg-bundle-id field (which indicates compatibility). But then someone who finds themselves with just the data package doesn't have a clear reference pointing to the authoritative source for the package, like a DOI would provide...

zaneselvans commented 4 years ago

Another snag!

We want the data that we're publishing to be reproducible, which means it needs to be processed by a published release of the PUDL software (v0.2.1) that can be referenced in an environment.yml file or some other record of the software environment, and reproduced by a user later. We also want the data to be validated before we publish it -- if ever there were a time when the data validation matters, it's when we're publishing data for others to use.

However at the moment these two things are at odds, because at the moment the data validation can only be run using pytest and tox and the mechanics of running the data validation is encoded in the tests/validation modules rather than the distributed package. We also don't want to get into a situation where in order to attempt to do a data release, we need to release a new version of PUDL, and then debug data release related issues, and then release another new version of PUDL, etc. for several cycles.

I think the right solution probably goes along with issue #400, moving the data validation test-case specifications into package_data, and any required functions into pudl.validate if they aren't already there, in such a way that:

the validation test modules can easily use the test cases to parameterize the tests,
the data validation debugging notebook(s) can easily make use of the test cases to visualize what's going on,
we can write a stand-alone data validation script which can be distributed as part of the package, and run after generation of the to-be-published data packages.

However, in the mean time, it seems like it might be best to just get something rough published in the same form that we expect to be publishing future data releases, so that people can at least play with the data, without having to run the whole ETL pipeline, and we can get the DOI established, and start playing with the Zenodo API a little...

lwinfree commented 4 years ago

Hi @zaneselvans!

For the DOI question, have you come to a conclusion? I'm happy to jump on a call to discuss if that would help (we can chat at our normal call on Monday, but I'm happy to chat earlier too). My thoughts in a nutshell: the large zenodo archive (containing datapacakges all generated by the same ETL, which should all be compatible*) gets one DOI. The datapackages inside that archive get 2 IDs - one that is the UUID for the datapackage, and one that is called something like "master_archive_id" that is the same as the archive's DOI. That way the single datapackages can be linked back to the master archive. Does that make sense? Am I missing something? I had to draw out a diagram to think this through 🙃

*assuming that all datapackages from the same ETL should be compatible

For the validation question, I'm going to tag @roll, but I think your idea here is correct:

I think the right solution probably goes along with issue #400, moving the data validation test-case specifications into package_data, and any required functions into pudl.validate if they aren't already there

zaneselvans commented 4 years ago

Yeah, I think without changes to the metadata specification, that's probably the best thing to do. We already have a datapkg-bundle-uuid field at the package level, so I guess we can just add a parallel datapkg-bundle-doi field alongside it for the bundles that get archived, and set the id field to a generic UUID so it can definitely be uniquely identified if it's found in the wild somewhere.

catalyst-cooperative / pudl

Write a data release script #425