DataONEorg / api-documentation

3 stars 1 forks source link

harmonize ORE, schema.org, and CodeMeta in DataONE packages #11

Open mbjones opened 4 years ago

mbjones commented 4 years ago

This issue is to discuss and decide on an architectural approach to include CodeMeta metadata documents as additional metadata within a DataONE data package alongside the other metadata such as EML or ISO that might be present for documenting data. This is useful when the package contains software such as R or python scripts in addition to the data in the package.

CodeMeta is a profile of schema.org, and is being harmonized to be completely congruent with schema.org. So, this discussion really revolves around how to integrate schema.org into DataONE packages, which would be nice considering that we are also providing schema.org in our dataset landing pages. The https://schema.org/Dataset structure in our landing pages in many ways is conceptually aligned with our ORE data package model, where the schema:Dataset plays the same role as ore:Aggregation. Our ORE files have other similar metadata as well, such as dc:identifier, as well as our PROV statements such as prov:wasDerivedFrom or prov:used. Consequently, we could easily consider our landing-page serialization of schema.org metadata to be a JSON-LD version of our current ORE package format. In fact, if we were to support JSON-LD as a serialization format for our packages, then the schema.org, CodeMeta, ORE, and PROV vocabularies could all be present and used in the same document, and so the package description would serializable as both JSON-LD as an ORE document or as JSON-LD as a schema.org file. My proposal, therefore, is that we:

This would allow us to integrate package metadata, schema.org metadata, PROV metadata, and CodeMeta metadata all in a coherent model. We've been discussing this with @atn38, @srearl, @twhiteaker, and other people from EDI and LTER for some specific guidelines there, and I will include that conversation in the next comment for reference.

mbjones commented 4 years ago

Thread on integrating CodeMeta for reference and further details:

Hello Matt,

For EDI, I believe the file is generated by Pasta, with some input from

info in the EML file.

PASTA does not support ProvONE, even if they do generate ORE pkg files, correct?

how to handle multiple metadata files in a package (e.g., both an EML and a

CodeMeta)

IMHO that's a great question, and it seems that convergence would be the dream, esp. using schema.org as an alternative to, and possibly eventually replacing (?), ORE pkg files. On a smaller scale though, we are writing a best practices document for software submissions to EDI, and in that we are tentatively recommending including both the code and the codemeta.json files as data entities and documenting them in EML. That makes the most sense atm and in the LTER sites context, however it does feel strange, as CodeMeta is itself a metadata document as you say, so we're documenting one form of metadata (CodeMeta) in another (EML), which in turn is documented by another metadata level (ORE). Doing it this way might require additional parsing to even detect a codemeta.json in a package. More pondering...

Here's the document https://docs.google.com/document/d/1BcDgTtrcC6bt814xnJT_aAHf_6booU2nMI83PWc7ots/edit#heading=h.whxk7rpmonse. We'd appreciate your feedback, esp. re:CodeMeta handling!

On Thu, Feb 6, 2020 at 11:00 AM Matthew Jones via RT support@arcticdata.io wrote:

Hi Tim,

In general in DataONE, the ORE package file is submitted along with the rest of the data package. For EDI, I believe the file is generated by Pasta, with some input from info in the EML file. I don't think Pasta currently allows you to modify the package file yourself, but many of the other DataONE members expose that functionality to clients. It is part of our client tools, and the R tools in "datapack" generate the files for you and give an API for working with them.

The CodeMeta file would be great in the recs.

A bit of a rambling tangent for you to ponder....I have been struggling with how to handle multiple metadata files in a package (e.g., both an EML and a CodeMeta). We have support for that in theory in DataONE, but in practice all packages really have a single metadata document. Except that we also have the ORE package file, which is a form of metadata. And it turns out that there is a strong correspondence between the role of the ORE package description and the schema.org entry on a dataset landing page. The ORE and the schema.org generally have the same info, and sometimes share vocabularies (like DCTerms). So, I've been contemplating making those compatible, and converting all of the metadata in the ORE (like PROV statements) into the schema.org representation. As it turns out, CodeMeta also follows the schema.org vocabulary, so I could totally see having the CodeMeta document be represented in the schema.org landing page (that's really what it is), and then also be in the ORE file in the DataONE package. Or possibly allow a schema.org file as an alternative package manifest to ORE in DataONE. There is a bunch to think through here, and I think great opportunities for convergence. I've been thinking the ESIP schema.org space is the best place to come together on some of these ideas that span vocabularies, and I will probably make a proposal to this effect sometime soonish. Would love to hear your and others' thoughts.

Matt

On Wed, Feb 5, 2020 at 2:21 PM Whiteaker, Timothy L via RT < support@arcticdata.io> wrote:

<URL: https://support.nceas.ucsb.edu/rt/Ticket/Display.html?id=19691 >

Hi Matt,

The easiest approach right now is to just add the information to the ORE package file for a dataset

Who performs that task? Does the data submitter do that on their own, or do you guide them through it, or do you do it yourself with guidance from the data submitter? For the parts that datapack doesn't cover, do you manually edit the ORE package file?

Does ProvONE handle relationships between entities residing in different packages?

We're working on best practices for LTER sites archiving software/code, simulation model inputs and outputs, and model parameter files. Among our recommendations, we're suggesting that folks archiving software/code include a CodeMeta file with their package. When a draft is ready for review, may we solicit your feedback on the suggested practices?

Tim Whiteaker Associate Director, Center for Water and the Environment< https://cwe.engr.utexas.edu/> The University of Texas at Austin

---------- Forwarded message --------- From: Matthew Jones via RT <support@arcticdata.io<mailto: support@arcticdata.io>> Date: Wed, Feb 5, 2020 at 12:12 PM Subject: Re: [arcticdata #19691] question: how ADC adds derived/source relationships to data entities To: enthusiast@utexas.edu<mailto:enthusiast@utexas.edu>

Hi An,

We've decided to put provenance information following the ProvONE model into the data package description for DataONE because it affords more flexibility across packages supporting different metadata standards (e.g., EML, ISO, FGDC, etc). We've also been discussing how providing that provenance information in the schema.orghttp://schema.org entry on a dataset landing page would also be helpful. Ultimately, I think the same information could be added to EML in additionalMetadata, but we would need to decide how that would be serialized and then how to extract that into the DataONE index for display.

It would be great to get more groups producing interoperable provenance traces. The easiest approach right now is to just add the information to the ORE package file for a dataset (we have R functions in the datapack package to help do that). Alternatively , we could work as a community to define how to include that information in EML. Let's chat!

Matt

On Wed, Feb 5, 2020 at 8:54 AM An T Nguyen via RT <support@arcticdata.io mailto:support@arcticdata.io> wrote:

<URL: https://support.nceas.ucsb.edu/rt/Ticket/Display.html?id=19691 >

Thanks for the information Jeanette! We probably can't do anything with this info right now, but it's always good to know.

On Tue, Feb 4, 2020 at 5:18 PM Jeanette Clark via RT < support@arcticdata.iomailto:support@arcticdata.io> wrote:

Hi An!

We add provenance relationship to the resource map of our data packages following the provONE model (

http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html

).

The interface displaying the icons (in addition to an interface to add the information) is built into metacatUI which is the software we use to edit and display our data packages. You can also add prov from R using the datapack package. Here is a tutorial from a training workshop we ran recently:

http://training.arcticdata.io/materials/arctic-data-center-training/programming-metadata-and-data-publishing.html#publish-data-to-the-arctic-data-center-test-site

.

Hope that helps!

Jeanette

mbjones commented 4 years ago

@cboettig @datadavev @ashepherd @amoeba @csjx Thoughts on this ORE / schema.org / CodeMeta integration discussion with respect to DataONE packages? Also, we should probably also throw the DCAT version 2 vocabulary into the mix, as it covers the same space as well and just came out this week.

cboettig commented 4 years ago

It would be great to see this happen; sounds like a hard problem.

:+1: to DCAT2, looks very promising.

Would you throw PAV into the mix? (https://github.com/pav-ontology/pav/wiki) I believe it's a PROV-extension/translation that I see as using a more schema.org-style of semantics (e.g. "author" can be a predicate, as in "doi:xxx was authored by bob" instead of "doi:xxx is associated with an entity bob who has a role of author").

I think schema.org is on to a good thing in (re-)defining all these terms in it's own namespace as a mechanism for harmonizing things. However, I also find the current schema.org namespace really lacking in provenance & the stuff DCAT2 dataset has for documenting formats, compression, checksums and the like. I know RDF is quite happy mixing vocabularies, but as a consumer really struggle with consuming such data (because (a) I don't want to run an OWL interpreter on the json-ld, and most OWL files steer away from declaring sameAs equivalences anyway, for good reason, but it makes it very hard to square the two different representations of "author" noted above, for instance).

amoeba commented 4 years ago

This sounds like a nice unification.

You got me thinking about how this might affect DataONE packaging and I see a few possible scenarios implied:

  1. JSON-LD as supplementary metadata in a traditional DataONE Data Package (i.e., don't replace EML, ISO, etc). Typed as METADATA in DataONE.
  2. JSON-LD as the primary metadata in a traditional DataONE Data Package (i.e., do replace EML, ISO, etc). Typed as METADATA in DataONE.
  3. JSON-LD as alternative to ORE, typed as RESOURCE in DataONE
  4. ORE becomes optional, JSON-LD plays a hybrid role between traditional DataONE Data Package ORE and the primary metadata record. This is a new concept for DataONE. This can also be read as METADATA becomes optional, e.g., you can include your authors, title, etc. in the ORE or your metadata and the result is the same.

Getting good support for (1) and (2) is important, valuable, and also lower hanging fruit. (3) sounds like a small improvement. (4) is a really cool idea and also a big shift from our current architecture. It looks a lot like an RDF-first approach versus our current hybrid RDF/XML + XML approach. It would mean that information connecting DataONE Objects could come from a variety of sources which is also a very RDF / graph-oriented way of thinking and I wonder if wouldn't then need to power all of this with an RDF triplestore to get full use.