freme-project / technical-discussion

This repository is used for technical discussions.
2 stars 0 forks source link

[NIF prov] dataset/model used to generate the annotation #108

Closed m1ci closed 7 years ago

neradis commented 8 years ago

@m1ci In your mail you detailed the question of this issue as:

e.g. have the string "Diego Maradona" and the link taIdentRef:dbpedia:Diego_Maradona, we need to encode also information that this link came form DBpedia

To me there are two separate aspects/variants of interpretation for 'used to generate':

details for interpretation of link targets

Sticking with the football example:

ex:fb1_offset_16_31
  nif:anchorOf "Diego Maradona"^^xsd:string ;
  itsrdf:taIdentRef dbpedia:Diego_Maradona ;
  nif:taIdentProv ex:spotlight_service_dbp2014 .

Although we know from the shape of URI dbpedia:Diego_Maradona that it's a DBpedia resource, we cannot tell whether the "traget namespace" for the linking service was the most recent release of DBpedia, a previous release or even a customized compilation of DBpedia dataset downloads. To make this clear we would need a concept like a "scoped resource identifier" (just a tentative idea on top of my head):

ex:fb1_offset_16_31
  itsrdf:taIdentRef [
    a idea:ScopedIdentifier;
    idea:identifier dbpedia:Diego_Maradona ;
    idea:dataset <http://dbpedia.org/release/2014/04/core>
  ]

training/parameterisation dataset information

This information seems to fit most naturally as an additional piece of information for the prov:SoftwareAgent resource used for provenance:

<http://api.freme-project.eu/example/description/e-entity/dbpedia-spotlight/documents/v0.5/dbp_2014_04_core>
  a prov:SoftwareAgent, doap:Version ;
  doap:shortdesc "NIF REST API for entity recognition and linking us  ing DBPedia Spotlight engine" ;
  doap:revision "0.5" ;
  dcterms:isPartOf <http://freme-project.eu/example/description/e-entity> ;
  prov:wasGeneratedBy [
    a prov:Activity ; 
    rdfs:comment "this describes the preparation process for the linking service (parametrisation, training, index creation...)"@en;
    prov:used <http://dbpedia.org/release/2014/04/core>, <http://dataid.dbpedia.org/lod/vu-wordnet/dataid#dataset>
  ] .

ex:fb1_offset_16_31
  itsrdf:taIdentRef dbpedia:Diego_Maradona ;
  nif:taIdentProv <http://api.freme-project.eu/example/description/e-entity/dbpedia-spotlight/documents/v0.5/dbp_2014_04_core>

If one wanted to simplify, one could consider freme:trainingDataset, freme:targetDataset properties that could be directly attachted to the prov:SoftwareAgent description. However, these new vocab items would be out of scope for NIF, in my opinion and would need to become part of a new ontology 'missing bits and pieces for FREME' ;-)

jnehring commented 8 years ago

One idea to make this more simple: We could model the information which dataset or model was used to produce an annotation via the tool that produced the annotation. The tool is specified in #104. For example if a service exposes named entity recognition with two different models, this can be considered as two different tools. Someone can decide to host additional information about these tools (e.g. that they are exposed by the same service) in a triple store if this information is really needed.

I further suggest to make this optional. to lower the entrance barrier to NIF. When When a developer of a NIF service wants to give detailed provenance information, it is a good idea to have a standard on this. Forcing developers to support all provenance information might scare off programmers because it is quiet complicated.

Although we know from the shape of URI dbpedia:Diego_Maradona that it's a DBpedia resource

It is possible also that the link was produced by another dataset which uses DBPedia identifiers.

jnehring commented 7 years ago

provenance discussion is over