catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
465 stars 107 forks source link

Write a data release script #425

Closed zaneselvans closed 4 years ago

zaneselvans commented 4 years ago

We're publishing / archiving our data releases at Zenodo, and they provide a RESTful API that we can access from within Python to automate the release process and to populate the Zenodo metadata fields, many of which we already compile for our own metadata.

In addition, we need to make sure that we generate the archived data in a controlled and reproducible way -- using a particular published release of the catalystcoop.pudl package, in a well-defined Python environment.

This will all be much more reliable and repeatable (and easier) if we have a script that does it for us the same way ever time. What all would such a script need to do?

zaneselvans commented 4 years ago

Another thing to maybe discuss here is at what granularity do we want to archive this data and assign DOIs?

I've been leaning toward releasing all the data packages in the same Zenodo archive, so they are obviously linked to each other, and so there's less overhead in doing a data release, and less duplication in what's getting archived. In this scenario all the data packages would share the single DOI that's assigned to the overall Zenodo archive... but is that wrong? Is that an okay way to use the DOIs?

The alternative would be to generate all the data packages at the same time, and then push each of them up to their own Zenodo archives, with their own DOIs. If we have a script that can do that in an automated way, maybe it's not much more overhead, but it seems like it might be more confusing for users to navigate.

zaneselvans commented 4 years ago

@lwinfree I'm curious if you have thoughts on the correct way to use DOIs in connection with the published data packages, as discussed above..

zaneselvans commented 4 years ago

Inputs that we need to feed into this script:

The ETL settings file should allow us to automatically:

If there's additional metadata that we need to populate at Zenodo, where should that live? Should we make sure that all of it is available within the datapackage metadata? Should it be stored in the ETL settings file? Should it be in its own YAML or JSON file?

lwinfree commented 4 years ago

Hi @zaneselvans ! OK I think I am leaning towards:

I've been leaning toward releasing all the data packages in the same Zenodo archive, so they are obviously linked to each other, and so there's less overhead in doing a data release, and less duplication in what's getting archived. In this scenario all the data packages would share the single DOI that's assigned to the overall Zenodo archive... but is that wrong? Is that an okay way to use the DOIs?

Breaking it down, I think the main archive should definitely have a DOI. I think it makes sense to have all the data packages in the same archive. I understand that if all the datapackages have the same DOI (that is, the archive DOI), then it could be confusing. But, could you generate a separate UUID for each datapackage during the ETL? In the ID field? So that the whole archive has a DOI but each datapackage is identifiable with its UUID?

zaneselvans commented 4 years ago

Right now the UUID identifies the ETL run that generated the data packages -- every run gets its own UUID which is stamped in all the data packages it outputs, and this UUID is used to verify inter-package compatibility any time the data packages are used together -- like when all the resources are combined into a single data package and that package is loaded into a database (since the ETL parameters could change between runs, and we don't want packages which were generated under different circumstances to be accidentally used together).

In the case where each data package (EPA CEMS, FERC Form 1, EIA 860/923) is archived as its own resource on Zenodo, there would be no "main" archive -- each of them would have their own master DOI, and their own lineage of versioned DOIs, and the only things linking particular versions of those packages together would be the ETL UUID, and any Zenodo level metadata pointing at other specific versioned DOIs for the other packages which are compatible with an individual package.

I guess another option would be to do both of these things -- have independent CEMS, FERC1, and EIA data packages, with their own DOIs, and also have a master PUDL Data Release archive, which has its own DOI, and which contains another copy of each of the independent data packages, with its own personal DOI stamped inside it. But that level of duplication seems kinda silly.

Another another option (which is maybe what you were suggesting) is that the individual data packages could just... not have DOIs in them at all and instead be identified by UUIDs alone. There could be a UUID in the id field, and another UUID in the datapkg-bundle-id field (which indicates compatibility). But then someone who finds themselves with just the data package doesn't have a clear reference pointing to the authoritative source for the package, like a DOI would provide...

zaneselvans commented 4 years ago

Another snag!

We want the data that we're publishing to be reproducible, which means it needs to be processed by a published release of the PUDL software (v0.2.1) that can be referenced in an environment.yml file or some other record of the software environment, and reproduced by a user later. We also want the data to be validated before we publish it -- if ever there were a time when the data validation matters, it's when we're publishing data for others to use.

However at the moment these two things are at odds, because at the moment the data validation can only be run using pytest and tox and the mechanics of running the data validation is encoded in the tests/validation modules rather than the distributed package. We also don't want to get into a situation where in order to attempt to do a data release, we need to release a new version of PUDL, and then debug data release related issues, and then release another new version of PUDL, etc. for several cycles.

I think the right solution probably goes along with issue #400, moving the data validation test-case specifications into package_data, and any required functions into pudl.validate if they aren't already there, in such a way that:

However, in the mean time, it seems like it might be best to just get something rough published in the same form that we expect to be publishing future data releases, so that people can at least play with the data, without having to run the whole ETL pipeline, and we can get the DOI established, and start playing with the Zenodo API a little...

lwinfree commented 4 years ago

Hi @zaneselvans!

For the DOI question, have you come to a conclusion? I'm happy to jump on a call to discuss if that would help (we can chat at our normal call on Monday, but I'm happy to chat earlier too). My thoughts in a nutshell: the large zenodo archive (containing datapacakges all generated by the same ETL, which should all be compatible*) gets one DOI. The datapackages inside that archive get 2 IDs - one that is the UUID for the datapackage, and one that is called something like "master_archive_id" that is the same as the archive's DOI. That way the single datapackages can be linked back to the master archive. Does that make sense? Am I missing something? I had to draw out a diagram to think this through 🙃

*assuming that all datapackages from the same ETL should be compatible

For the validation question, I'm going to tag @roll, but I think your idea here is correct:

I think the right solution probably goes along with issue #400, moving the data validation test-case specifications into package_data, and any required functions into pudl.validate if they aren't already there

zaneselvans commented 4 years ago

Yeah, I think without changes to the metadata specification, that's probably the best thing to do. We already have a datapkg-bundle-uuid field at the package level, so I guess we can just add a parallel datapkg-bundle-doi field alongside it for the bundles that get archived, and set the id field to a generic UUID so it can definitely be uniquely identified if it's found in the wild somewhere.