Closed gothub closed 3 years ago
When reviewing the v1.2.0 release, i see that funding and PROV info has yet to be included in the indexing.
@mbjones @datadavev should these two items be included now in the indexing?
Items to considering regarding inclusion of funding info:
is the representation of funding ready or still undergoing revisions?
the example in the v1.2.0 guide uses SO:fundedItem
to link the SO:Datase
t id to funding info
the v1.2.0 full example uses a different method, using SO:funder
to link SO:Dataset
id to funding info
a live example from BCODMO uses a similar method to the full example, usingSO:funder
to link SO:Dataset
id to funding info
should just the method used in the full example be used?
Items to considering regarding inclusion of PROV info:
these Solr fields could be indexed directly now, if i'm interpreting the guide correctly:
these Solr fields aren't mentioned directly in SOSO v1.2.0 guide:
So, should the items that could be added now (the first section of the list) be added to the index?
Thanks @gothub for the writeup. Overall, yes, I think those things should be included. As to the specific issues you raised:
SO:fundedItem
versus SO:funder
: that sounds like a bug in the guidelines that we will need to clarify, and once we do, we should follow that approachSO:isBasedOn
is a synonym of prov:wasDerivedFrom
so we should be parsing for bothprov:used
, prov:wasGeneratedBy
, and prov:hadPlan
, and these should be able to be used to infer their inverse properties like prov:generated
and prov:hasDerivations
. prov_usedByProgram
and prov_usedByExecution
prov:hadPlan
is an instance of provone:Execution
, and the object is and instance of provone:Program
, etc.prov_wasRevisionOf
in the obsoletes
SOLR field, and its inverse is in obsoletedBy
@mbjones populating the field prov_wasDerivedFrom
, prov_generatedByExecution
, prov_generatedByProgram
, prov_instanceOfClass
(for provone:Data only) have been added to indexing (PR will be submitted soon). These fields make sense to me as they involve the SO:Dataset
being indexed and have the Dataset id as the subject of these relationships.
It's not clear to me how to include prov:used
relationships in a SO:Dataset
description as there isn't an inverse relationship to include the SO:Dataset
id as the subject. Here is a hypothetical example that would use this unavailable relationship:
{
...
"@id": "http://lod.example-data-repository.org/id/dataset/3300",
"url": "https://www.example-data-repository.org/dataset/3300",
"@type": "Dataset",
...
"prov:wasUsedBy": {
"@id": "https://example.org/executions/execution-42",
"@type": "provone:Execution",
"prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
}
}
Also, would we ever have a prov_instanceOfClass
other than provone:Data
? Could a SO:Dataset
be anything else?
I'm not following what the issue is, @gothub. Is it that we don't index the 'used' relationships at all? prov:used
is the inverse property of prov:wasUsedBy
, and so, either way, we can extract the relationship between an input dataset, the program that processed it, and the output dataset. Here's the full example we wrote for the SOSO guidelines:
{
"@context": {
"@vocab": "https://schema.org/",
"prov": "http://www.w3.org/ns/prov#",
"provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
},
"@id": "https://doi.org/10.xxxx/Dataset-2",
"@type": "Dataset",
"name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
"prov:wasDerivedFrom": { "@id": "https://doi.org/10.xxxx/Dataset-1" },
"schema:isBasedOn": { "@id": "https://doi.org/10.xxxx/Dataset-1" },
"prov:wasGeneratedBy":
{
"@id": "https://example.org/executions/execution-42",
"@type": "provone:Execution",
"prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
"prov:used": { "@id": "https://doi.org/10.xxxx/Dataset-1" }
}
}
You should be able to extract that Dataset-2
was derived from Dataset-1
, and that it was generated by execution-42
, which used Dataset-1
and followed the program process-script.R
. If there aren't fields for each of those relations in SOLR, maybe we should consider adding them. Can you clarify where the problem lies?
@mbjones I just want to verify that this is the correct way to specify derived Datasets, for example, if the following were added to the example you gave above:
"prov:wasUsedBy": {
"@id": "https://example.org/executions/execution-101",
"@type": "provone:Execution",
"prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R",
"prov:generated": "https://somerepository.org/datasets/10.xxxx/Dataset-101"
}
From this snippet, the following Solr fields can be derived:
OK, yeah, I follow where you are going. And yes, I think what you concluded is fine, except for the derivation part. Let's discuss what is meant by derivation. The workflow we have described is basically:
usedBy generated usedBy generated
Dataset-1 -------> execution-42 ---------> Dataset-2 -------> execution-101 ---------> Dataset-101
That execution-42
used
Dataset-1
does not necessarily imply that Dataset-2 prov:wasDerivedFrom Dataset-1
. This is why we had that explicit assertion in the dataset (a computation can use an input for other things than generating a particular output). So, in your case, I think that, in order to index Dataset-101
in the prov_hasDerivations field, I think an explicit statement must be included (in this case, expressed using the inverse property):
"prov:hadDerivation": "https://somerepository.org/datasets/10.xxxx/Dataset-101"
Once that is in place, you can include Dataset-101 in the prov_hadDerivations
field, but not if you only have the prov:generated
statement.
@mbjones thanks for the clarification, i'll update the indexing for prov_hasDerivations
The remaining PROV relationships have been added in commit 1e8749ce8908f152589d904ca1f76d3086750083
Verify that the parsing performed by the JsonLdSubprocessor handles the guidance outlined in the v1.2.0 release:
One potential item: