void:inDataset - Githubissues

AlasdairGray commented 9 years ago

How should the void:inDataset property be used?

Currently we have it in the table as a SHOULD property for distribution level descriptions but this is misleading as the description should be pointed to from the data. How should we show this in the table?

The example in the guidance notes is wrong as it should be a specific resource in the dataset that is linked back to the description. The explanation should be extended to adequately explain the correct usage of the property.

This is not something that the validator can check.

micheldumontier commented 9 years ago

I see your point. I think it's ok to have it in the table, even though it is an inverse relation from a data item to a distribution uri. the guidance note should spell that out. yes, the example needs to be fixed to point to an actual distribution uri

AlasdairGray commented 9 years ago

Suggested actions

Move void:inDataset to its own table in Section 5 (It is really a usage of a dataset description)
Move text from Section 6.4.3 to Section 7.1 and revise text with specific example
Renumber Section 7.1 to Section 7.2

micheldumontier commented 9 years ago

Another related issue is that void:inDataset points from RDF document URI to a void:Dataset URI . I had always assumed that it was between a resource URI and a dataset URI. So we may have to find another relation between a data item and a dataset.

agbeltran commented 9 years ago

+1 to Michel, void:inDataset is used to indicate that the triples serialised in an RDF document belong to the dataset (see http://www.w3.org/TR/void/). so, for the provenance of a data item maybe we can use dct:isPartOf rather than void:inDataset?

AlasdairGray commented 9 years ago

The void:inDataset example is a bit of a gotcha in the VoID documentation. I'd completely missed the fact that it was the DBPedia data namespace rather than the resource namespace. This does alleviate one of the concerns that @JervenBolleman had around the number of additional triples he'd need to add to the UniProt data. However it does not meet the use case requirement of linking a triple back to its dataset.

@agbeltran unfortunately dct:isPartOf is used for a different interpretation, particularly in VoID datasets (see note in Section 6.3 of the VoID Note).

I'm not aware of a similar property to void:inDataset.

micheldumontier commented 9 years ago

Alasdair and I believe that while using dc:isPartOf is very natural for a relation between an assertion (reified triple/RDF Statement or assertion graph in nanopublication/ovopub), it is a bit more challenging with respect to the relation between an subject, predicate, or object of a triple to a dataset. Here are some options we thought of

dc:isPartOf : http://purl.org/dc/terms/isPartOf sio:refers-to : http://semanticscience.org/resource/refers-to

or some new relation

utilizes / is-utilized-in
is-data-item-in / has-data-item

which could easily be added to SIO, or some other vocabulary.

thoughts?

JervenBolleman commented 9 years ago

If you want to relate a triple to a dataset one needs to reify the triple (or have another way of identifying a triple. e.g. single member named graphs). To say a triple is in a dataset is in the proper scope of PROV-O. To make this part of the data explodes the dataset or description size for no real world gain.

e.g.

uniprot:P05067 a up:Protein ;
[] rdf:subject uniprot:P05067;
   rdf:predicate a;
   rdf:object up:Protein ;
  void:inDataset uniprotkb:release2014_11,uniprotkb:release2014_10,uniprotkb:release2014_09, ... swiss-prot:10.2

To say a resource is described/talked about in a dataset is in the scope of void/hcls dataset descriptions. But it should be an optional thing, as in the correct way it is a listing of all unique IRI's in a dataset. For UniProt that is about 2 billion values, similar for ChEMBL etc.. The listing of all unique IRI's in a dataset is interesting but not of a high value to our users.

In the end I am doubtful that there is a solid well thought out usecase for void:inDataset or similar constructs that are not much better solved by PROV-O use in the original dataset (or having a dataset consisting of nanopublications)

micheldumontier commented 9 years ago

@JervenBolleman what is the PROV-O relation to say a triple is in a dataset?

This relation between a component of a triple and a dataset is already optional - it is not necessary that you see it as a vital use case, although others, including myself, see it as such. what we are doing is determining which relation should be used to express this.

void:inDataset has foaf:Document as domain. So you would be inferring the RDF statement as a document.

JervenBolleman commented 9 years ago

@micheldumontier I am too tired, reading SHOULD as must. SHOULD is ok, although I would like OPTIONAL/MAY better.

PROV-O example to my understanding

uniprot:P05067 a up:Protein ;
uniprotkb:release2014_07    prov:hadMember [ rdf:subject uniprot:P05067;
     rdf:predicate a;
     rdf:object up:Protein ; 
     a prov:Entity, rdf:Statement]

But I would not be surprised if I am wrong in this interpretation.

micheldumontier commented 9 years ago

I don't think we want to add so many triples. I'm going to propose the addition of a new set of relations (object properties) to SIO:

has-data-item / is-data-item-in has data item is a relation between a dataset and any described/referenced entity. 'is data item in' is a relation between an entity that is described or referenced in a dataset.

micheldumontier commented 9 years ago

added new entries to SIO

ariutta commented 9 years ago

Would it be possible to add usage guidelines to the descriptions of the new terms, e.g., when to use SIO:is-data-item-in and when to use void:inDataset?

W3C-HCLSIG / HCLSDatasetDescriptions

void:inDataset #90