review sandbox zenodo metadata

cmgosnell commented 3 years ago

A variety of metadata fields can be associated with each deposition at Zenodo, these are defined in the developer documentation.

Currently we appear to be populating:

[x] title: currently "Dataset Source" e.g. "Eia860 Source" Can we change this to use appropriate capitalization and potentially be more informative?
[x] publication_date: same as upload date.
[x] doi: as assigned by Zenodo.
[x] keywords: the keywords associated with the data source (e.g. eia861) in our existing PUDL metadata.
[x] license: CC0 for US Govt. data, would need to be different for e.g. ISO or derived datasets.
[x] version: What versioning scheme are we using for this version number? Does it follow the semantic version of the archived data package? Or are we just assigning a new major version number to each subsequent archive which is published? I notice that the sandbox version of the EPA CEMS data has multiple DOIs associated with the same version number (1.0.0). Is there any circumstance under which that is appropriate and expected?
[x] upload_type: dataset
[x] access_right: open (everything is open for now)

Required fields which we should populate with meaningful information:

[x] publication_type: other
[x] creators: an array of creator objects including individual names (Family, Given), affiliations (Catalyst Cooperative) and creator IDs if available. Who should we associate with these archives?
[ ] description: currently we have just a one-liner. Can we add some more context here? E.g.
- [ ] Note about relation to the PUDL project and Catalyst
- [ ] Motivation for creating a persistent versioned archive of public data that isn't persistent or versioned or programmatically accessible?
- [ ] Pointer to documentation about the Data Package Standard and the included datapackage.json file for programmatic access to the archived data.
- [ ] Pointer to the PUDL software and pudl_datastore.py script for for an existing implementation of programmatic access.

Optional fields which we can populate with meaningful information:

[ ] communities: Catalyst Cooperative (this is an archive of all the archives we have published, so folks can see what all we've uploaded to Zenodo). Not sure how one specifies this, the docs say to use the "community identifier", but the Catalyst community is here. I'd guess it's probably catalyst-cooperative.
[ ] language: English
[ ] related_identifiers:
- [ ] isCompiledBy: reference to the software by URL (Github repo) or DOI (if we have Zenodo archive that repo) that was used to collect and archive this data. Ideally this would include the particular release or commit ID that was used for the creation of this particular archive's generation.
- [ ] isSupplementTo: Should refer to the DOI of any data release that we make which is based on this archive. We can add these references by hand when we make a new data release, but we could also set it up to be an automatic part of the data release process.
[ ] dates: ISO8601 formatted (YYYY-MM-DD) start and end dates indicating what range of time the data covers. This is relevant to most of our datasets.

URLs for the current round of archives. (The https://doi.org/ based URLs do not resolve... is that just because it's the sandbox server?)

[x] eia860: https://sandbox.zenodo.org/record/504556
[x] eia861: https://sandbox.zenodo.org/record/504558
[x] eia923: https://sandbox.zenodo.org/record/504560
[x] epaipm: https://sandbox.zenodo.org/record/602953
[x] epacems: https://sandbox.zenodo.org/record/638878
[x] ferc1: https://sandbox.zenodo.org/record/504562

zaneselvans commented 3 years ago

I looked through all the Zenodo metadata fields in the developer documentation, and all of our current archives, and listed the fields that I think we should be populating in the issue description above. The ones I checked off are the ones we're currently populating that seem fine. The ones that aren't checked off I think need some work, or at least discussion as to whether and how we want to populate them.

ptvirgo commented 3 years ago

Blockers

I'm going to risk posting the comment as best I can here. Try and remember that just because my style is terse, does not mean I'm trying to hurt feelings.

Title: By definition, I already filled out the titles with what I consider to be appropriate capitalization, punctuation, and detail. If you want a different title, be more informative. You need to give me a spec, not a series of essay questions and guessing games.
The version numbers can be specified manually or, by default, assume an incompatible change and jump accordingly. This was discussed at length and largely at your request. The sandbox has test data and experiments it, as should be expected.
Creators My name does not go in. My suggestion is either use the agency name or leave it alone, but again, spec.
Description - Looks like an essay question instead of a spec to me. Pass.

The other parts are probably manageable, but if I were genuinely trying to get some power plants closed I wouldn't devote more than a few minutes on it.

ptvirgo commented 3 years ago

"publication_type" is a controlled vocabulary intended for cases where the "upload_type" is a "publication." Our "upload_type" is a "dataset."

ptvirgo commented 3 years ago

"isCompiledBy" is intended to reference software compilation. As in, a binary pointing to it's compiler. https://0-www-crossref-org.biblio.url.edu/education/content-registration/structural-metadata/relationships/

catalyst-cooperative / pudl