DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

Verify schema.org indexing compatibility with SOSO v1.2.0 #10

Closed gothub closed 3 years ago

gothub commented 3 years ago

Verify that the parsing performed by the JsonLdSubprocessor handles the guidance outlined in the v1.2.0 release:

One potential item:

gothub commented 3 years ago

When reviewing the v1.2.0 release, i see that funding and PROV info has yet to be included in the indexing.

@mbjones @datadavev should these two items be included now in the indexing?

Items to considering regarding inclusion of funding info:

So, should the items that could be added now (the first section of the list) be added to the index?

mbjones commented 3 years ago

Thanks @gothub for the writeup. Overall, yes, I think those things should be included. As to the specific issues you raised:

gothub commented 3 years ago

@mbjones populating the field prov_wasDerivedFrom, prov_generatedByExecution, prov_generatedByProgram, prov_instanceOfClass (for provone:Data only) have been added to indexing (PR will be submitted soon). These fields make sense to me as they involve the SO:Dataset being indexed and have the Dataset id as the subject of these relationships.

It's not clear to me how to include prov:used relationships in a SO:Dataset description as there isn't an inverse relationship to include the SO:Dataset id as the subject. Here is a hypothetical example that would use this unavailable relationship:

{
...
  "@id": "http://lod.example-data-repository.org/id/dataset/3300",
  "url": "https://www.example-data-repository.org/dataset/3300",
  "@type": "Dataset",
...
  "prov:wasUsedBy": {
   "@id": "https://example.org/executions/execution-42",
   "@type": "provone:Execution",
   "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
   }
}

Also, would we ever have a prov_instanceOfClass other than provone:Data? Could a SO:Dataset be anything else?

mbjones commented 3 years ago

I'm not following what the issue is, @gothub. Is it that we don't index the 'used' relationships at all? prov:used is the inverse property of prov:wasUsedBy, and so, either way, we can extract the relationship between an input dataset, the program that processed it, and the output dataset. Here's the full example we wrote for the SOSO guidelines:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": { "@id": "https://doi.org/10.xxxx/Dataset-1" },
  "schema:isBasedOn": { "@id": "https://doi.org/10.xxxx/Dataset-1" },
  "prov:wasGeneratedBy": 
      {
        "@id": "https://example.org/executions/execution-42",
        "@type": "provone:Execution",
        "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
        "prov:used": { "@id": "https://doi.org/10.xxxx/Dataset-1" }
      }
}

You should be able to extract that Dataset-2 was derived from Dataset-1, and that it was generated by execution-42, which used Dataset-1 and followed the program process-script.R. If there aren't fields for each of those relations in SOLR, maybe we should consider adding them. Can you clarify where the problem lies?

gothub commented 3 years ago

@mbjones I just want to verify that this is the correct way to specify derived Datasets, for example, if the following were added to the example you gave above:

"prov:wasUsedBy": {
    "@id": "https://example.org/executions/execution-101",
    "@type": "provone:Execution",
    "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R",
    "prov:generated": "https://somerepository.org/datasets/10.xxxx/Dataset-101"
  }

From this snippet, the following Solr fields can be derived:

mbjones commented 3 years ago

OK, yeah, I follow where you are going. And yes, I think what you concluded is fine, except for the derivation part. Let's discuss what is meant by derivation. The workflow we have described is basically:

           usedBy               generated            usedBy                 generated
Dataset-1 -------> execution-42 ---------> Dataset-2 -------> execution-101 ---------> Dataset-101

That execution-42 used Dataset-1 does not necessarily imply that Dataset-2 prov:wasDerivedFrom Dataset-1. This is why we had that explicit assertion in the dataset (a computation can use an input for other things than generating a particular output). So, in your case, I think that, in order to index Dataset-101 in the prov_hasDerivations field, I think an explicit statement must be included (in this case, expressed using the inverse property):

"prov:hadDerivation": "https://somerepository.org/datasets/10.xxxx/Dataset-101"

Once that is in place, you can include Dataset-101 in the prov_hadDerivations field, but not if you only have the prov:generated statement.

gothub commented 3 years ago

@mbjones thanks for the clarification, i'll update the indexing for prov_hasDerivations

gothub commented 3 years ago

The remaining PROV relationships have been added in commit 1e8749ce8908f152589d904ca1f76d3086750083