diachron / quality

Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
MIT License
8 stars 4 forks source link

Declarative language for quality metric definitions #41

Open clange opened 10 years ago

clange commented 10 years ago

Let's think about a declarative language for quality metrics.

I.e. that large parts of the implementation of a new metric would be implemented in the form of a dataset that's an instance of the daQ vocabulary.

In pseudo code e.g. a declarative representation of the UndefinedClassesOrProperties metric could look like this:

IF TRIPLE MATCHES ?s rdf:type|rdfs:subClassOf|rdfs:domain|rdfs:range ?c 
                # ^^^ This would be a SPARQL graph pattern
THEN CHECK
  # Here we could use a SPARQL FILTER expression:
  (dqf:DereferenceableAsLOD(?c)
   || dqf:ExistsLocallyInThisDataset(?c)
   || dqf:OtherwiseKnownToUs(?c))
  && dqf:QuerySucceeds(?c a owl:Class)
                     # ^^^ once more a SPARQL graph pattern
    # Actually this check is more complex
    # but I'll leave it like this for now for the example

Complex operators like DereferenceableAsLOD or ExistsLocallyInThisDataset or QuerySucceeds would be realised as custom SPARQL functions with a Java implementation, reusing code from methods we already have. (I used dqf for our custom namespace of “data quality functions”.)

Compare page 7 of http://svn.aksw.org/papers/2013/ISWC_LODStats/public.pdf. They get by without complex operators, but their task is simpler than ours.

This language could include elements for generating problem reports, which we need for cleaning. (@jerdeb @nfriesen please edit this into "quality report" if that's the correct term)