Verify schema.org indexing compatibility with SOSO v1.2.0

gothub commented 3 years ago

Verify that the parsing performed by the JsonLdSubprocessor handles the guidance outlined in the v1.2.0 release:

One potential item:

verify geospatial coverages are parsed correctly

gothub commented 3 years ago

When reviewing the v1.2.0 release, i see that funding and PROV info has yet to be included in the indexing.

@mbjones @datadavev should these two items be included now in the indexing?

Items to considering regarding inclusion of funding info:

is the representation of funding ready or still undergoing revisions?
the example in the v1.2.0 guide uses SO:fundedItem to link the SO:Dataset id to funding info
the v1.2.0 full example uses a different method, using SO:funder to link SO:Dataset id to funding info
a live example from BCODMO uses a similar method to the full example, usingSO:funder to link SO:Dataset id to funding info
should just the method used in the full example be used?

Items to considering regarding inclusion of PROV info:
these Solr fields could be indexed directly now, if i'm interpreting the guide correctly:
- prov_wasDerivedFrom
- prov_generatedByExecution
- prov_generatedByProgram
these Solr fields aren't mentioned directly in SOSO v1.2.0 guide:
- prov_generated
- prov_used
- prov_hasDerivations
- prov_hasSources
- prov_usedByProgram
- prov_usedByExecution
- prov_wasInformedBy
- prov_generatedByUser
- prov_wasExecutedByUser
- prov_usedByUser
- prov_instanceOfClass
- these Solr fields don't currently exist, but are in the SOSO guide:
- prov_wasRevisionOf

So, should the items that could be added now (the first section of the list) be added to the index?

mbjones commented 3 years ago

Thanks @gothub for the writeup. Overall, yes, I think those things should be included. As to the specific issues you raised:

SO:fundedItem versus SO:funder: that sounds like a bug in the guidelines that we will need to clarify, and once we do, we should follow that approach
For provenance:
- Note that SO:isBasedOn is a synonym of prov:wasDerivedFrom so we should be parsing for both
- The SOSO guidelines definitely discuss prov:used, prov:wasGeneratedBy, and prov:hadPlan, and these should be able to be used to infer their inverse properties like prov:generated and prov:hasDerivations.
- In SOSO we made the explicit decision to simplify the relationship between Execution and Program, but it should be able to provide info for the fields prov_usedByProgram and prov_usedByExecution
- The informedBy and *ByUser fields probably don't have corresponding recommendations, but in theory someone could add additional triples including the user info for executions if they wanted to, even though it isn't listed in SOSO
- instanceOfClass should be inferred from its uage (e.g., the subject of prov:hadPlan is an instance of provone:Execution, and the object is and instance of provone:Program, etc.
- in SOLR, we store prov_wasRevisionOf in the obsoletes SOLR field, and its inverse is in obsoletedBy

gothub commented 3 years ago

@mbjones populating the field prov_wasDerivedFrom, prov_generatedByExecution, prov_generatedByProgram, prov_instanceOfClass (for provone:Data only) have been added to indexing (PR will be submitted soon). These fields make sense to me as they involve the SO:Dataset being indexed and have the Dataset id as the subject of these relationships.

It's not clear to me how to include prov:used relationships in a SO:Dataset description as there isn't an inverse relationship to include the SO:Dataset id as the subject. Here is a hypothetical example that would use this unavailable relationship:

{
...
  "@id": "http://lod.example-data-repository.org/id/dataset/3300",
  "url": "https://www.example-data-repository.org/dataset/3300",
  "@type": "Dataset",
...
  "prov:wasUsedBy": {
   "@id": "https://example.org/executions/execution-42",
   "@type": "provone:Execution",
   "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
   }
}

Also, would we ever have a prov_instanceOfClass other than provone:Data? Could a SO:Dataset be anything else?

mbjones commented 3 years ago

I'm not following what the issue is, @gothub. Is it that we don't index the 'used' relationships at all? prov:used is the inverse property of prov:wasUsedBy, and so, either way, we can extract the relationship between an input dataset, the program that processed it, and the output dataset. Here's the full example we wrote for the SOSO guidelines:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": { "@id": "https://doi.org/10.xxxx/Dataset-1" },
  "schema:isBasedOn": { "@id": "https://doi.org/10.xxxx/Dataset-1" },
  "prov:wasGeneratedBy": 
      {
        "@id": "https://example.org/executions/execution-42",
        "@type": "provone:Execution",
        "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
        "prov:used": { "@id": "https://doi.org/10.xxxx/Dataset-1" }
      }
}

You should be able to extract that Dataset-2 was derived from Dataset-1, and that it was generated by execution-42, which used Dataset-1 and followed the program process-script.R. If there aren't fields for each of those relations in SOLR, maybe we should consider adding them. Can you clarify where the problem lies?

gothub commented 3 years ago

@mbjones I just want to verify that this is the correct way to specify derived Datasets, for example, if the following were added to the example you gave above:

"prov:wasUsedBy": {
    "@id": "https://example.org/executions/execution-101",
    "@type": "provone:Execution",
    "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R",
    "prov:generated": "https://somerepository.org/datasets/10.xxxx/Dataset-101"
  }

From this snippet, the following Solr fields can be derived:

prov_hasDerivations (https://somerepository.org/datasets/10.xxxx/Dataset-101)
prov_usedByProgram (https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R)
prov_usedByExecution (https://example.org/executions/execution-101)

mbjones commented 3 years ago

OK, yeah, I follow where you are going. And yes, I think what you concluded is fine, except for the derivation part. Let's discuss what is meant by derivation. The workflow we have described is basically:

           usedBy               generated            usedBy                 generated
Dataset-1 -------> execution-42 ---------> Dataset-2 -------> execution-101 ---------> Dataset-101

That execution-42 used Dataset-1 does not necessarily imply that Dataset-2 prov:wasDerivedFrom Dataset-1. This is why we had that explicit assertion in the dataset (a computation can use an input for other things than generating a particular output). So, in your case, I think that, in order to index Dataset-101 in the prov_hasDerivations field, I think an explicit statement must be included (in this case, expressed using the inverse property):

"prov:hadDerivation": "https://somerepository.org/datasets/10.xxxx/Dataset-101"

Once that is in place, you can include Dataset-101 in the prov_hadDerivations field, but not if you only have the prov:generated statement.

gothub commented 3 years ago

@mbjones thanks for the clarification, i'll update the indexing for prov_hasDerivations

gothub commented 3 years ago

The remaining PROV relationships have been added in commit 1e8749ce8908f152589d904ca1f76d3086750083

DataONEorg / d1_cn_index_processor

Verify schema.org indexing compatibility with SOSO v1.2.0 #10