catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Clean up and finish compiling PUDL Metadata #416

Closed zaneselvans closed 4 years ago

zaneselvans commented 4 years ago

Along with #399 (removing extraneous / unused metadata) we need to update and clean up the Megadata JSON file, to get rid of all the "idfk" and other placeholders, and fill it up with good information, before we publish it.

Also need to add @gschivley as a contributor.

cmgosnell commented 4 years ago

Some of the clean up will happen in Issue #399, but in terms of adding things, I think it would be good to add in keywords and version. For keywords, I assume we can have pudl level keywords and dataset level keywords and squish them together according to what datasets are in the package. Should these be stored in constants or in the megadata file?

Anything else?

zaneselvans commented 4 years ago

General thoughts...

Other package level metadata:

Resource / Schema level metadata

zaneselvans commented 4 years ago

Need to break this out into several issues:

cmgosnell commented 4 years ago

Okay. The id and the version can and will be added via #419 and #426. I'd like to say that all of the ENUM/constants mess should be considered not a part of this issue.

I've added start_date and end_date into the sources... from my understanding and experience, we can add any additional fields into the metadata. The sources are associated with the data package and with the resources. I also extracted the start_date and end_date from the sources and associated them with a the resources as well. This may be too much.

Also, worth noting, now the bundle_id_pudl is the uuid, which is used internally to check if multiple data packages were generated as a part of the same bundle of packages.

If we extract all of these sub-issues, I think this issue should be closed now? @zaneselvans what do you think?

zaneselvans commented 4 years ago

I think having the start_date and end_date for the data in the resource only associated with the resource is plenty, and it's usually better to just have one authoritative location.

Other than that, yeah I think all the other stuff is now covered in the other listed issues.

zaneselvans commented 4 years ago

Oh but I do still owe you some keywords for the various data sources.