Clean up and finish compiling PUDL Metadata

zaneselvans commented 4 years ago

Along with #399 (removing extraneous / unused metadata) we need to update and clean up the Megadata JSON file, to get rid of all the "idfk" and other placeholders, and fill it up with good information, before we publish it.

Also need to add @gschivley as a contributor.

cmgosnell commented 4 years ago

Some of the clean up will happen in Issue #399, but in terms of adding things, I think it would be good to add in keywords and version. For keywords, I assume we can have pudl level keywords and dataset level keywords and squish them together according to what datasets are in the package. Should these be stored in constants or in the megadata file?

Anything else?

zaneselvans commented 4 years ago

General thoughts...

Should we remove CPI from the contributors list here and also in LICENSE.txt?
For published data packages I think we should use the id field to store the Zenodo DOI for the package, which can be reserved in advance of publication through the Zenodo API.
The UUID that we use to identify data packages that are part of the same bundle should probably be given its own separate field, like pudl-bundle-id since it's not something that can be looked up in any other registry.
What is the unit of data that we are going to be archiving? Is it the bundle of data packages, or is it the individual data packages? If it's the individual data packages, then they will each need their own separate DOI, and they'll each get their own stream of archives on Zenodo, but will be linked to each other by the UUID, which indicates they are mutually compatible. We want folks to be able to download individual data packages and not just the entire bundle, but I think we can include multiple individual files within the Zenodo archive. We should look into how one can programmatically access those individual files, and design our archiving to make it easy to access individual data packages that way.
In the sources list, are we actually allowed to have fields other than title, path and/or email? Right now it includes the ETL parameters for each of the sources as a dictionary as well (which I agree is good metadata to include -- just wondering where it should best go)

Other package level metadata:

Version of the PUDL software which was used to generate the data package (and for every data release, this should correspond to an archived software release, rather than a development version from git). This would be in its own (extra) field, maybe pudl-version?
Version of the data package How should we do this versioning? Frictionless has these guidelines, which follow semantic versioning rules. The versions on the individual data packages may change differently over time.
Start Date and End Date for the data that's included in the package, in ISO-8601 format (YYYY-MM-DD) or maybe a Date Range -- something that's machine parsable and indicates what the temporal extent of the data is. If it's possible for this extent to be different for individual resources within the data package, then maybe it should be stored at the tabular data resource level instead?

Resource / Schema level metadata

As mentioned above, temporal extent/coverage for data in a resource, if that's the right place for it to be stored (rather than at the package level -- e.g. in the case of epacems, the eia860 data is only 2011 and on, while the hourly-emissions data goes back to 1995)
Do we want to specify additional constraints on any of the fields?
Do we want to change any of the small glue-like tables into ENUMs?
Do we want to store re-used ENUMs outside of the megadata so they aren't duplicated?
Should we include official units of measure for fields where they are well defined? What is the best canonical set of units as strings for machine parsing out there?

zaneselvans commented 4 years ago

Need to break this out into several issues:

Issue #52: Catalog WTF is actually being stored in constants.py so we can have a conversation about where and how to store the ENUMs and other small data structures, which aren't really code.
Issue #419: Use the Zenodo API to reserve a DOI for data packages pre-publication.
Issue #425: Write a data package publishing/release script which makes use of the Zenodo API to publish our metadata directly rather than requiring it to be done by hand.
Issue #426: Enrich the individual Field level metadata with additional constraints, machine readable units, re-usable ENUMs, etc.

cmgosnell commented 4 years ago

Okay. The id and the version can and will be added via #419 and #426. I'd like to say that all of the ENUM/constants mess should be considered not a part of this issue.

I've added start_date and end_date into the sources... from my understanding and experience, we can add any additional fields into the metadata. The sources are associated with the data package and with the resources. I also extracted the start_date and end_date from the sources and associated them with a the resources as well. This may be too much.

Also, worth noting, now the bundle_id_pudl is the uuid, which is used internally to check if multiple data packages were generated as a part of the same bundle of packages.

If we extract all of these sub-issues, I think this issue should be closed now? @zaneselvans what do you think?

zaneselvans commented 4 years ago

I think having the start_date and end_date for the data in the resource only associated with the resource is plenty, and it's usually better to just have one authoritative location.

Other than that, yeah I think all the other stuff is now covered in the other listed issues.

zaneselvans commented 4 years ago

Oh but I do still owe you some keywords for the various data sources.

eia860: electricity, electric, boiler, generator, plant, utility, fuel, coal, natural gas, prime mover, us, eia860, retirement, capacity, planned, proposed, energy, hydro, solar, wind, nuclear, form 860, eia, annual, gas, ownership, steam, turbine, combustion, combined cycle, energy information administration, eia
eia923: fuel, boiler, generator, plant, utility, cost, price, natural gas, coal, eia923, us, energy, electricity, form 923, receipts, generation, net generation, monthly, annual, gas, fuel consumption, MWh, energy information administration, eia, mercury, sulfur, ash, lignite, bituminous, subbituminous, heat content
ferc1: electricity, electric, utility, plant, steam, generation, cost, expense, price, heat content, ferc, form 1, federal energy regulatory commission, capital, accounting, depreciation, finance, plant in service, hydro, coal, natural gas, gas, opex, capex, accounts, investment, us, capacity
epacems: epa, us, emissions, pollution, ghg, so2, co2, sox, nox, load, utility, electricity, plant, generator, unit, generation, capacity, output, power, heat content, mmbtu, steam, cems, continuous emissions monitoring system, environmental protection agency, ampd, air markets program data, hourly
epaipm: @gschivley do you have any keywords that would make sense here?

catalyst-cooperative / pudl