Intention of adding provenance of DOIs with isBasedOn?

alko-k commented 4 years ago

Hi again @ashepherd , is there an intention of adding the isBasedOn schema.org property to refer to older DOIs on the full dataset json?

Thanks Alexandra

ashepherd commented 4 years ago

great idea, @alko-k! Could you write up a proposal here with an example we could use in the documentation?

alko-k commented 4 years ago

Thanks Adam, I will write up an example. For now this is what google suggests: -Use the isBasedOn property in cases where the republished dataset (including its metadata) has been changed significantly. -When a dataset derives from or aggregates several originals, use the isBasedOn property.

I also add the schema git issue for isBasedOn: https://github.com/schemaorg/schemaorg/issues/1993

mbjones commented 4 years ago

schema:isBasedOn is a reasonable although lightweight provenance statement. In our other work, we use PROV-O predicates like prov:wasDerivedFrom, prov:used, and prov:generatedBy to express a more nuanced set of relationships among source and derived objects and the processes that were run. It seems like schema:isBasedOn is equivalent to prov:wasDerivedFrom, but lacks the ability to link to the processes that generated the derived data from the source data.

Here's an example data package in which we've embedded the PROV-O properties in our ORE manifest for the data package. You can look at the RDF triples we're using with a tool like rapper:

$ rapper -o turtle https://cn.dataone.org/cn/v2/object/resource_map_urn:uuid:c2e7831c-3e38-4ac1-a0b5-dff3a00ad9f1

So, I'd like to see our guidance recommend PROV-O vocabularies for provenance, with a recommendation that schema:isBasedOn could also be used and should be considered equivalent to prov:wasDerivedFrom.

alko-k commented 4 years ago

Thanks @mbjones for all your insight and examples. There is a small issue though that the 'structured-data/testing-tool' google provides, will not pass the test with the additional prov properties...

mbjones commented 4 years ago

Yeah, we have encountered that issue of the Google SDTT throwing an error when it encounters types outside of schema.org. It is annoying for sure. We have discussed that with Google, and they indicate that the Google tools ignore those type errors and that they still import documents with other types, but they ignore the other types. We've asked them to change them to Warnings, but they have indicated that the SDTT is focused on Google's import, and so they want to keep those as errors. For our recommendations, we've decided to 1) mostly recommend schema.org types, but 2) to go ahead and recommend other types when needed if there isn't something suitable in schema.org. Our recommendations on external vocabularies in the @type field are being discussed in issue #74 and our proposed language is about to be merged in PR #95, and our take is written up in the decision record on schema.org/additionalType.

ThomasThelen commented 4 years ago

Other projects have had similar goals of using schema.org to describe science artifacts and they all seem to trickle external vocabularies in as they need to describe specifics. One example of a specification that mixes schema.org and W3C Prov is RO Crate. You can see they they used mostly schema in this example but also brought in prov (and used it side by side with schema).

Normal w3c prov

Alt text

As JSON-LD,

{
   "@context":[
      {
         "prov":"http://www.w3.org/ns/prov#"
      }
   ],
   "@graph":[
      {
         "@id":"plot.py_execution",
         "@type":"prov:Activity",
         "prov:used":"daily-total-female-births.csv"
      },
      {
         "@id":"daily-total-female-births.csv",
         "@type":"prov:Entity"
      },
      {
         "@id":"female-daily-births.png",
         "@type": "prov: Entity",
         "prov:wasGeneratedBy":"plot.py_execution"
      }
   ]
}

Minimal Extension to ProvONE

Note the provone namespace Alt text

{
   "@context":[
      {
         "prov":"http://www.w3.org/ns/prov#"
      },
      {
         "provone":"http://purl.dataone.org/provone/2015/01/15/ontology#"
      }
   ],
   "@graph":[
      {
         "@id":"plot.py_execution",
         "@type":"provone:Execution",
         "prov:used":"daily-total-female-births.csv"
      },
      {
         "@id":"daily-total-female-births.csv",
         "@type":"provone:Data"
      },
      {
         "@id":"female-daily-births.png",
         "@type": "prov: Entity",
         "prov:wasGeneratedBy":"plot.py_execution"
      }
   ]
}

mbjones commented 4 years ago

Began a new branch https://github.com/ESIPFed/science-on-schema.org/tree/feature_72_provenance for editing the Guide and a new proposed provenance ADR for how we recommend handling provenance information. Editing is not complete, still working on:

Explanatory text
Figures showing the examples
Examples for each of the types of provenance relationships

mbjones commented 4 years ago

@ashepherd @datadavev @fils @alko-k @smrgeoinfo Completed first draft of the provenance proposal. Please review the:

@amoeba @csjx @gothub @mpsaloha Given your familiarity with our use of PROV and ProvONE in DataONE, I would appreciate if you could give this a look over as well. You'll note that I omitted the use of prov:qualifiedAssociation, and so I'd like to discuss the reasoning implications of omitting this. I'm not happy with it, but at the same time doing it correctly complicated the graph sufficiently to question whether it is sensible in the schema.org context. We may want to use another predicate than prov:hadPlan that we can equate via punning to the more complicated PROV model with Association.

datadavev commented 4 years ago

mbjones commented 4 years ago

@davev thanks for the pointer on schema:Action , I was unaware of that. I think it could be successfully used in place of provone:Execution, albeit with less semantic specificity. Maybe schema:CreateAction would be better. We could I suppose propose to create a schema:ExecuteAction term in schema.org as a subclass of schema:Action, which would be an equivalent property to provone:Execution. There doesn't seem to be an equivalent property to prov:hadPlan, but we might be able to make schema:instrument work to indicate the role of software if you don't worry over semantics too much. Here's my original proposed structure:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": "https://doi.org/10.xxxx/Dataset-1",
  "schema:isBasedOn": "https://doi.org/10.xxxx/Dataset-1",
  "prov:wasGeneratedBy": 
      {
        "@id": "https://example.org/executions/execution-42",
        "@type": "provone:Execution",
        "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
        "prov:used": "https://doi.org/10.xxxx/Dataset-1"
      }
}

And here's the same structure rewritten with only schema.org using CreateAction in place of Execution:

{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@graph": [
    {
      "@id": "https://doi.org/10.xxxx/Dataset-2",
      "@type": "https://schema.org/Dataset",
      "https://schema.org/name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
      "schema:isBasedOn": "https://doi.org/10.xxxx/Dataset-1"
    },
    {
      "@id": "https://example.org/executions/execution-42",
      "@type": "schema:CreateAction",
      "schema:instrument": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
      "schema:object": "https://doi.org/10.xxxx/Dataset-1",
      "schema:result": "https://doi.org/10.xxxx/Dataset-2"
    }
  ]
}

I think schema:result is the inverse of prov:wasGeneratedBy, so that works. It's a little more convoluted to have to create the implicit graph because there is no equivalent property to prov:wasGeneratedBy, or at least schema:result is the inverse, making the attachment to schema:Dataset different. However, I'm really unclear if this is the proper use of schema:object, where I used it in place of prov:used. The definition of schema:object is:

The object upon which the action is carried out, whose state is kept intact or changed. Also known as the semantic roles patient, affected or undergoer (which change their state) or theme (which doesn't). e.g. John read a book.

Which is confusingly similar to schema:result to me. They use the book as an example for both result and object, so I'm not sure really what they represent. Thoughts?

If we did all of this with schema.org, I'd want to be explicit in the guide as to the intended mapping to PROV so that equivalence could be had with people using the more precise PROV and ProvONE vocabularies. I think by comparing them to more explicit vocabularies we can make our intended interpretation clear. Feedback appreciated.

datadavev commented 4 years ago

Your example seems ok to me. I read schema:object as a target (input) of some Action, and schema:result as an outcome (output) of an Action.

That said, what is the practical benefit of using only schema.org terms when there is an established practice using the prov and provone semantics? Would it be better (less ambiguous) to promote the use of prov and provone and provide a mapping of terms from those to schema for consumers not familiar with prov?

datadavev commented 4 years ago

btw, this is another way of writing your second example above to be slightly more Dataset centric:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "resultOf": {
      "@reverse": "result"
    }
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "isBasedOn": "https://doi.org/10.xxxx/Dataset-1",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "resultOf": {
    "@id": "https://example.org/executions/execution-42",
    "@type": "CreateAction",
    "instrument": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
    "object": "https://doi.org/10.xxxx/Dataset-1"
  }
}

amoeba commented 4 years ago

This looks really good and the edits to the Dataset guide look and read great.

Something that stands out to me is the shape of the prov:wasGeneratedBy example, specifically the "@id": "https://example.org/executions/execution-42", triple. If I ran into that in a guide I wouldn't know what to do because of the made-up URI. I don't know if non-resolving URIs really fit into the Schema.org pattern or SOSO for that matter.

I might flatten it, like:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": "https://doi.org/10.xxxx/Dataset-1",
  "prov:wasGeneratedBy": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R"
}

(The prov:wasGeneratedBy triple now isn't really valid as the object isn't really a prov:Activity.)

I can see you're trying to find a way to capture an execution explicitly and my example makes the execution implicit and vague. Another property, like foo:wasDerivedBy might make this a little more clear that prov:wasDerivedFrom and prov:wasGeneratedBy are connected but ultimately my example is less rich.

While Schema.org tends to be pretty flat, SOSO doesn't really shy away from it, so an alternative to my super flat example might look like:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": "https://doi.org/10.xxxx/Dataset-1",
  "prov:wasGeneratedBy": {
    "@id": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
    "@type": "Foo",
    "foo:used": "https://doi.org/10.xxxx/Dataset-1"
  }
}

But I don't think we have the terms to do this right now.

mbjones commented 4 years ago

@amoeba Thanks, Bryce. I agree about the @id for the execution instance. I seriously considered making it a blank node by omitting the @id because people often don't track executions. They do, however, track execution times and other properties, so it would be nice to have something to hang those properties on, and to differentiate multiple executions of the same script (especially for model runs, etc). But there's been a lot discussion in this group about avoiding blank nodes, so I thought it prudent to put in some stand-in for the execution identifier. I would prefer to leave it out though.

mbjones commented 4 years ago

@PaoloMissier @ludaesch do you have any thoughts on this issue discussing provenance representation in schema.org and PROV/ProvONE? See in particular: https://github.com/ESIPFed/science-on-schema.org/issues/72#issuecomment-664717609 and the comments that follow.

mpsaloha commented 4 years ago

I think this looks good, Matt!

I second avoiding blank-nodes except in the case where we could rarely if ever imagine wanting to reference that blank-node's graph in some other context.

I was curious why "schema:isBasedOn" is not also recommended for the case where "prov:wasRevisionOf" is used, given that the definition of "schema:isBasedOn" is so broad:

A resource from which this work is derived or from which it is a modification or adaption.

and as "prov:wasRevisionOf" is an rdfs:sub-property of "prov:wasDerivedFrom", it seems that "schema:isBasedOn" is also appropriate for describing this type of "modification that retains substantial content from the original entity" (sensu PROV:wasRevisionOf).

Similarly, in the diagram "Indicating a software workflow or processing activity: prov:used and prov:wasGeneratedBy"

the "prov:wasRevisionOf" would also seem to fit this template and might be added to the diagram (shown as a sub-property?), where its potential representation by "schema:isBasedOn" predicate could be depicted?

So it might be useful at least to clarify in the text that "PROV:wasRevisionOf" is a sub-property of "PROV:wasDerivedFrom", and possibly as well that the former could also be represented by "schema:isBasedOn"?

I thought about the statement that "schema:isBasedOn" is an OWL:equivalentProperty with "PROV:wasDerivedFrom". I think this might be a bit overstep, as I'm not sure their extensions would be identical. I feel that "schema:isBasedOn" is a bit broader. For example, I would be comfortable saying that the movie "West Side Story" schema:isBasedOn the book "Romeo & Juliet", but would be less comfortable asserting that the movie "West Side Story" prov:wasDerivedFrom the book "Romeo & Juliet". I would be comfortable saying the movie "West Side Story" prov:wasInfluencedBy (i.e. super-property of "prov:wasDerivedFrom") the book "Romeo & Juliet".

Anyhow, just some thoughts and hoping not dancing on the head of a pin.

thanks, Mark

On Tue, Jul 28, 2020 at 1:42 PM Matt Jones notifications@github.com wrote:

@amoeba https://github.com/amoeba Thanks, Bryce. I agree about the @id for the execution instance. I seriously considered making it a blank node by omitting the @id because people often don't track executions. They do, however, track execution times and other properties, so it would be nice to have something to hang those properties on, and to differentiate multiple executions of the same script (especially for model runs, etc). But there's been a lot discussion in this group about avoiding blank nodes, so I thought it prudent to put in some stand-in for the execution identifier. I would prefer to leave it out though.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/science-on-schema.org/issues/72#issuecomment-665272403, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHLL6KDWGE6MQT6TCEBCUTR54Z45ANCNFSM4KEFWI4Q .

mbjones commented 4 years ago

Thanks for the comments @mpsaloha. I was equating schema:isBasedOn with prov:wasDerivedFrom, whereas we have interpreted the subproperty prov:wasRevisionOf to specialize the property to the narrower case where the new entity is both derived from the original AND it represents a new version of the same entity. So, all revisions are derivations, but not all derivations are revisions. In DataONE, we interpret prov:wasRevisionOf to mean that the new object is meant to explicitly replace the original version, and is wholly substitutable for the orginal. We use that to hide older versions of Datasets in search results. And there are definitely broader uses of prov:wasDerivedFrom, such as when two data sources are combined into an integrated whole, but the new Dataset is not meant to replace the original per se. So I'd like us to be able to express both the derived from and replaces semantics from PROV.

I looked for a subproperty in SO that was equivalent to prov:wasRevisionOf, and didn't find a match. There could be one though. The closest thing I could find is that there is schema:UpdateAction which is meant to explicitly be an action in which the schema:result replaces the schema:object, but because these same properties are used in all schema:Action classes, such as schema:CreateAction, the interpretation of the schema:result as a "replacement" only applies within the context of schema:UpdateAction. So, I couldn't find a dedicated subproperty indicating replacement semantics in SO, and I left the prov:wasRevisionOf for the time being. Maybe there's another approach.

mpsaloha commented 4 years ago

Hi Matt,

My comments interlaced below...

Thanks for the comments @mpsaloha https://github.com/mpsaloha. I was equating schema:isBasedOn with prov:wasDerivedFrom, whereas we have interpreted the subproperty prov:wasRevisionOf to specialize the property to the narrower case where the new entity is both derived from the original AND it represents a new version of the same entity. So, all revisions are derivations, but not all derivations are revisions.

Agreed-- since "prov:wasRevisionOf" is a subproperty of "prov:wasDerivedFrom" its extension is included, but narrower than the extension of its superproperty. Hope it was clear that I agree with this interpretation.

In DataONE, we interpret prov:wasRevisionOf to mean that the new object is meant to explicitly replace the original version, and is wholly substitutable for the orginal.

Ah, this specializes a bit on the definition of "prov:wasRevisionOf" as described in the PROV specs, where it is recommended for when some entity is a modification of, but "contains substantial content" of its precursor (and doesn't specify "replaces").

We use that to hide older versions of Datasets in search results. And there are definitely broader uses of prov:wasDerivedFrom, such as when two data sources are combined into an integrated whole, but the new Dataset is not meant to replace the original per se. So I'd like us to be able to express both the derived from and replaces semantics from PROV.

I see, thanks for clarifying! I think the "prov:qualifiedRevision" could be used to indicate this "replaces" function. Resembling example (44) in https://www.w3.org/TR/prov-o/

I looked for a subproperty in SO that was equivalent to prov:wasRevisionOf, and didn't find a match.

Yes, I looked too and couldn't find one.

There could be one though. The closest thing I could find is that there is schema:UpdateAction https://schema.org/UpdateAction which is meant to explicitly be an action in which the schema:result replaces the schema:object, but because these same properties are used in all schema:Action classes, such as schema:CreateAction https://schema.org/CreateAction, the interpretation of the schema:result as a "replacement" only applies within the context of schema:UpdateAction. So, I couldn't find a dedicated subproperty indicating replacement semantics in SO, and I left the prov:wasRevisionOf for the time being. Maybe there's another approach.

Thanks for pointing this out, and I agree it doesn't seem to fit the bill as is-- although it might work as a triple of "schema:replaceAction" within a "prov:qualifiedRevision" pattern (v. examples 44 and 62 in prov-o). But this complicates things a bit. I guess my main concerns were 1) formally stating that schema:isBasedOn is an owl:equivalentProperty of prov:wasDerivedFrom due to potentially non-congruent extensions; and 2) that as prov:wasRevisionOf is a subproperty of prov:wasDerivedFrom, the schema:isBasedOn is also suitable for describing it. But you've described how you also want prov:wasRevisionOf to strongly indicate "replaces earlier version". Nevertheless, schema:isBasedOn would remain true even in this case-- just less constraining?

cheers, Mark

On Thu, Jul 30, 2020 at 11:54 AM Matt Jones notifications@github.com wrote:

Thanks for the comments @mpsaloha https://github.com/mpsaloha. I was equating schema:isBasedOn with prov:wasDerivedFrom, whereas we have interpreted the subproperty prov:wasRevisionOf to specialize the property to the narrower case where the new entity is both derived from the original AND it represents a new version of the same entity. So, all revisions are derivations, but not all derivations are revisions. In DataONE, we interpret prov:wasRevisionOf to mean that the new object is meant to explicitly replace the original version, and is wholly substitutable for the orginal. We use that to hide older versions of Datasets in search results. And there are definitely broader uses of prov:wasDerivedFrom, such as when two data sources are combined into an integrated whole, but the new Dataset is not meant to replace the original per se. So I'd like us to be able to express both the derived from and replaces semantics from PROV.

I looked for a subproperty in SO that was equivalent to prov:wasRevisionOf, and didn't find a match. There could be one though. The closest thing I could find is that there is schema:UpdateAction https://schema.org/UpdateAction which is meant to explicitly be an action in which the schema:result replaces the schema:object, but because these same properties are used in all schema:Action classes, such as schema:CreateAction https://schema.org/CreateAction, the interpretation of the schema:result as a "replacement" only applies within the context of schema:UpdateAction. So, I couldn't find a dedicated subproperty indicating replacement semantics in SO, and I left the prov:wasRevisionOf for the time being. Maybe there's another approach.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/science-on-schema.org/issues/72#issuecomment-666600545, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHLL6LWQHKJH6LBUKBWD7LR6G6WPANCNFSM4KEFWI4Q .

datadavev commented 4 years ago

Perhaps SO:ReplaceAction (The act of editing a recipient by replacing an old object with a new object) with its replacee and replacer corresponds with prov:wasRevisionOf? Though the description doesn't necessarily mean the replacer is a revision, could be just a new instance.

mpsaloha commented 4 years ago

Dave-- I think we'd then lose the notion of the replacer being a revision_of rather than simply substitute_for the replacee. The example they give of changing movies is very different from the notion that the derived entity contains significant components of the original entity.

Mark

On Thu, Jul 30, 2020 at 2:54 PM Dave Vieglais notifications@github.com wrote:

Perhaps SO:ReplaceAction (The act of editing a recipient by replacing an old object with a new object) with its replacee and replacer corresponds with prov:wasRevisionOf? Though the description doesn't necessarily mean the replacer is a revision, could be just a new instance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/science-on-schema.org/issues/72#issuecomment-666729841, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHLL6NCV6KKMRABLTSCUITR6HT2XANCNFSM4KEFWI4Q .

datadavev commented 4 years ago

Yep, agreed. The act of substitution is clear, but the notion that the replacement is a revision of the replacee is not.

Dave-- I think we'd then lose the notion of the replacer being a revision_of rather than simply substitute_for the replacee. The example they give of changing movies is very different from the notion that the derived entity contains significant components of the original entity. Mark

rduerr commented 4 years ago

Of all of the options above the one that is most human understandable is:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": "https://doi.org/10.xxxx/Dataset-1",
  "schema:isBasedOn": "https://doi.org/10.xxxx/Dataset-1",
  "prov:wasGeneratedBy": 
      {
        "@id": "https://example.org/executions/execution-42",
        "@type": "provone:Execution",
        "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
        "prov:used": "https://doi.org/10.xxxx/Dataset-1"
      }
}

And I think an @id is necessary because I can see people needing to query for every dataset/product that used a particular commonly used script as part of a processing chain, when that script is found to have a bug in it that requires reprocessing everything that used it.

smrgeoinfo commented 4 years ago

I edited @rduerr 's example to preserve the formatting in the JSON, no content change.

+1 for that encoding approach

mbjones commented 4 years ago

Discussed the proposal and ADR during the SOSO call on Aug 3. General consensus that the use of PROV-O and ProvONE predicates was preferred because of their increased semantic precision. We agreed to move towards approving the ADR, but will give people another week or so to comment. @mbjones will prepare a PR with minor revisions shortly thereafter.

rduerr commented 4 years ago

I looked at the ADR and updated text - looks good to me.

mbjones commented 4 years ago

Uploaded the current ProvONE OWL file to COR for better community visibility and navigation. See: http://cor.esipfed.org/ont?iri=http://purl.dataone.org/provone/2015/01/15/ontology%23

mbjones commented 3 years ago

PR #134 merged in the accepted provenance features into develop and will now be included in the release, so closing this issue.

ESIPFed / science-on-schema.org

Intention of adding provenance of DOIs with isBasedOn? #72

Normal w3c prov

Minimal Extension to ProvONE