Helmholtz Kernel Information Profile integration

OpenEnergyPlatform / oemetadata

Repository for the Open Energy Family metadata. Contains metadata templates, examples and schemas. For metadata conversion see https://github.com/OpenEnergyPlatform/omi

https://openenergyplatform.github.io/oemetadata/

MIT License

21 stars 3 forks source link

Helmholtz Kernel Information Profile integration #138

Open christian-rli opened 6 months ago

christian-rli commented 6 months ago

Description of the issue

As pointed out by @carstenhoyerklick there should be a way to handle the Helmholtz Kernel Information Profile in oemetadata or to at least map it.

Ideas of solution

New field? Reference in an existing one? Please discuss.

Workflow checklist

[x] I am aware of the workflow in CONTRIBUTING.md

carstenhoyerklick commented 6 months ago

In chapter 3 there a number of data fields with should be part of the metadata. We should look at the keys there and check which ones we are maybe missing yet.

e.g.
• UnterEmbagroUntil

For the most we have an equivalent in the OEMetadata, we should list a mapping there.

carstenhoyerklick commented 4 months ago

I started a mapping between the Helmholtz KIP and OE Metadata.

https://docs.google.com/spreadsheets/d/1Q0tWNRujw3taKw4f-jUVjlFl2anZe38WtVpcD9IlO2s/edit?usp=sharing

The legend is roughly:

White: Existing in OEMetadata
Yellow: not existing in OEMetadata
Orange: Maybe we can use information from the databus Metadata

jh-RLI commented 4 months ago

Great, that's a very helpful first step. I will provide a full example of the new oemetadata version proposal in advance of our meeting next week.

christian-rli commented 2 months ago

Thank you for a helpful start @carstenhoyerklick . Do I understand correclty that the proposed course of action is to implement all the fields highlighted in yellow and ignore the orange ones, because they are covered by the databus? Or should we implement the orange ones as well, so that the information can be shown in the regular metadata? Either way it's quite a big extension of the current standard. Do we agree that all of the fields should show up there? If yes, I'm happy to implement them in the example files and schemas.

carstenhoyerklick commented 2 months ago

My personal preference would be to implement them all and to use the meatadata string as a master source to populate the databus. On the other side, if you have information on two places there is the danger of contradicting information. Which again might be a reason to have it all in the metadata string, as an authoritative source.

christian-rli commented 2 months ago

@jh-RLI and I agree. We will implement them in the next version. The resulting list of fields will be quite long and intimidating. Therefore we also decided that the tooling will take that into account. The conversion and export software will return the metadata with only the populated fields by default - empty fields are provided optionally.

christian-rli commented 2 months ago

@jh-RLI and I sorted though the new tags and came up with a structure. We thought it made sense to group almost all new keys together on resource level.

"trace": {
  "alternateOf": "",
  "checksum": "",
  "dateModified": "",
  "digtalObjectLocationAccessProtocol": "",
  "digitalObjectType": "",
  "hadPrimarySource": "",
  "hasMetadata": "",
  "isMetadataFor": "",
  "policy": "",
  "provenanceGraph": "",
  "specializationOf": "",
  "version": "",
  "wasDerivedFrom": "",
  "wasGeneratedBy": "",
  "wasRevisionOf": "",
  "wasQuotedFrom": "",
  "contributort stas": [
    {
      "title": "John Doe",
      "email": "contact@example.com",
      "date": "2016-06-16",
      "object": "data and metadata",
      "comment": "Fix typo in the title."
    }
  ]
},

The key name is open for debate. We were looking for something that encompasses things you would need for provenance and reproducibility. Other candidates were 'track', 'trail', 'linked data' or 'provenance'. Currently we like 'trace', but feel free to convince us otherwise.

Other notes:

I understand "isMetadataFor" such that by default it would describe the resource on the OEP. In other words the key would be a duplicate "id" most of the time. Therefore on the OEP it should basically be hidden virtually all the time.

There is no explanation for "locationPreview". Can you help out @carstenhoyerklick ?

"underEmbargoUntil" can go next to date. It's a bit awkward to implement, because one turns into the other, ideally, but if it's not actually published on the planned date there has to be a logic on the OEP to deal with that.

@Carsten can you maybe elaborate on the "wasQuotedFrom" field? What's the difference between sources? Does this concern the entire dataset (i.e. this whole table is actually a quote from another resource) or is it meant to reflect sources for parts of the data. Maybe another key within sources or a redefinition to a URI would help here. I assume it's not a "quotedBy" that lists where the resource has been quoted.

carstenhoyerklick commented 2 months ago

I understand "isMetadataFor" such that by default it would describe the resource on the OEP. In other words the key would be a duplicate "id" most of the time. Therefore on the OEP it should basically be hidden virtually all the time.

I think we should think beyond the OEP here. For the OEP id doubles, but for other repositories it may not. I think it is fair to hide it on the OEP.

There is no explanation for "locationPreview". Can you help out @carstenhoyerklick ? According the HMC document HMC Kernel Informaiton Profile Page 22 it is a web-resolvable point to a preview, e.g. a low-resolution image of the object referenced. It comes from a RDA recommendation.

This may be relevant for non tabular data. E.g. GIS data sets, they can be connected to a preview.

"underEmbargoUntil" can go next to date. It's a bit awkward to implement, because one turns into the other, ideally, but if it's not actually published on the planned date there has to be a logic on the OEP to deal with that.

I think it is save to ignore it on the OEP, as it takes only published data. But it may be relevant for other platforms.

@carsten can you maybe elaborate on the "wasQuotedFrom" field? What's the difference between sources? Does this concern the entire dataset (i.e. this whole table is actually a quote from another resource) or is it meant to reflect sources for parts of the data. Maybe another key within sources or a redefinition to a URI would help here. I assume it's not a "quotedBy" that lists where the resource has been quoted.

What it means is that this data set which is documented is quoted in another data set. It is also an RDA recommendation. It could be that the documented data set is a sub-set of a larger data set, which has been devided. IsQuotedFrom could be an umbrella data set which references this data set as a subset. It is a kind of a backpointer.

carstenhoyerklick commented 2 months ago

@jh-RLI and I sorted though the new tags and came up with a structure. We thought it made sense to group almost all new keys together on resource level.

I thought a while about it and I think we have to make some careful thoughts.

Some of the things as alternateOf or checksum, 'digtalObjectLocationAccessProtocolordigitalObjectType` may more in the general part.

We have thing about what are source and what are revisions. In general if a data set is revised, the original data set is a source. But you could thinks of source are data sets that we used to produce the data set. The new data set has been created by a fusion/modeling process and these are the data sources. These source may have very different characteristics than the target data set.

Revisions are a bit different. The characteristics of the data stays basically the same. A revision may also change some of the structures of the data.

The Helmholtz Kernel information profile differentiates between different types of sources. wasDerivedFrom is probably closest to the sources we have. specializationOf could be a subset of a larger data set or something similar which make this data set more special than the original or a data set specifically enriched . wasRevisionOf probably is more towards an update of the data set. The characteristics come from RDA or PROV-DM (Prov Data Model). Therefore I think we cannot ignore these. But we have to find a way to handle the difference source-target relations which come from the PROV-Data Model