SHACL/RDFUnit for mappings

kurzum commented 7 years ago

@VladimirAlexiev Hey we have an initial version of transforming the mappings to RML, see here https://github.com/dbpedia/extraction-framework/tree/rml/mappings/rml/en

Would you be so kind to go through your issues and see how we can use SHACL and/or RDFUnit to find quality issues? Maybe comment on the respective issues directly and then post the numbers in the comment in this issue. You don't have to write them all by yourself, but it would be good for the community to have a blueprint. Thanks, Sebastian

andimou commented 7 years ago

@kurzum quality assessment of mapping rules expressed in RML and performed by RDFUnit was covered in the past: https://link.springer.com/chapter/10.1007/978-3-319-25010-6_8

What is it additionally expected from your side with SHACL?

VladimirAlexiev commented 7 years ago

Hm, that would be an interesting task but I'm not sure I can do it since I don't fully understand the RML mappings.

I compared http://mappings.dbpedia.org/index.php?title=Mapping_en:DavisCup_player&oldid=28857 to https://github.com/dbpedia/extraction-framework/blob/rml/mappings/rml/en/Mapping_en:DavisCup_player.ttl

The mapping is very simple:

{{TemplateMapping | mapToClass = TennisPlayer
| mappings = 
    {{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
}}

The result has 45 triples, of which about 30 have to do with functions. @andimou, @wmaroy: Why are functions needed to map a single field without any transformation? Why are function invocations emitted as RDF? Where are these triples stored? How are result triples collected and isolated from these temporary execution triples?

(In addition to these questions, I posted two problems: https://github.com/dbpedia/extraction-framework/issues/507, https://github.com/dbpedia/extraction-framework/issues/508).

I'll try to dig an appropriate problem from my old presentation http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html

VladimirAlexiev commented 7 years ago

@andimou refreshed myself on the paper (Sec 4)

"Consistency validation of the mapping definitions": but if RML generates valid R2RML, that's not needed. Hopefully one can't write a template mapping that would cause invalid R2RML
"Consistency validation and quality assessment of the dataset as projected by its mapping definitions": gives an example of checking a mapped property and its target rdf:type against subclass axioms. But if you look at http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-7-3, most such errors are in the data, eg United_Kingdom, Switzerland, Kajang, Prehistory, 18_май are not Persons (DQA not MQA).

I think @kurzum's idea is to try some specific validations. I went through my presentation and a lot are hard to formalize, eg:

how to capture that cyrilliqueName is an idiotic property, and instead one should use name with lang tag bg or ru or sr-Cyrl or whatever is appropriate?
how to capture that event means the same as sportsDiscipline and thus remove this prop?

But here are a couple of examples

http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-1-3 To avoid non-sense mappings like {{ PropertyMapping | templateProperty = 1 | ontologyProperty = number }} we shouldn't use numbered props but only named props. I.e. rml:reference shouldn't start with number
Check that each GeocoordinatesMapping complies with the model at http://mappings.dbpedia.org/index.php/Template:GeocoordinatesMapping, i.e. uses 1, 2 or 8 source props. A similar task is described in https://github.com/dbpedia/extraction-framework/issues/308 but I couldn't find the respective BG mapping because of https://github.com/dbpedia/extraction-framework/issues/511. A valid example is https://github.com/dbpedia/extraction-framework/blob/rml/mappings/rml/en/Mapping_en:Infobox_airport.ttl. The task is to ensure that each dbf:latFunction (eg <Function/LatitudeFunction>) has a correct complement of 1, 2 or 8 related params, eg dbf:latDirectionParameter is part of the 8
Check that no invalid namespaces are used. Eg https://github.com/dbpedia/extraction-framework/issues/512: uses http://en.dbpedia.org/resource/, should be http://dbpedia.org/resource/. But if that's the only case, there's little point to write a shape for it, just fix it globally. (The same example uses dbr-en:Conrwall that is misspelt. But one can't check this easily in the mapping)

wmaroy commented 7 years ago

@VladimirAlexiev

Why are functions needed to map a single field without any transformation?

Every property extraction undergoes a transformation since all values of a property in an Infobox are in wikitext and are not clean. The SimplePropertyFunction extracts these values based on the different parameters that can be given to it (it is based on the PropertyMapping from the wiki mappings). Additionally, the predicate where this value is mapped to also has an influence on the transformation (dates are parsed differently for example). This extra parameter for the function (the ontologyPropertyParameter) is added afterwards in a reasoning step (derived from rr:predicate) to limit the size of the RML mappings. These different parsers can also be used on its own, but the SimplePropertyFunction decides the parser based on what predicate is used.

Where are these triples stored? How are result triples collected and isolated from these temporary execution triples?

Currently, these are not stored and are only present in memory when executing.

Why are function invocations emitted as RDF?

This allowed the aligning of RML and FnO, introduced in the following paper: https://biblio.ugent.be/publication/8525863

VladimirAlexiev commented 7 years ago

How are result triples collected and isolated from these temporary execution triples??

wmaroy commented 7 years ago

@VladimirAlexiev Only the result triples are collected from the RML Processor by the Extraction Framework. The temporary execution triples are not stored in the current implementation. The RML Processor knows how to isolate these through a FunctionTermMap. All the triples that are generated through a FunctionTermMap are used to execute a transformation that generates result triples. These are the final triples that will be returned by this TermMap.

VladimirAlexiev commented 7 years ago

@wmaroy all values of a property in an Infobox are in wikitext and are not clean

The extraction framework deals with that. The mapping mechanism maps from dbp to dbo, it doesn't clean up values (with a few exceptions like combining geo parameters). Is your RML implementation also intended to replace the extraction framework?

Having rr:TriplesMaps for FunctionTermMaps in the RML looks worse than writing assembly. I don't think 30x increase in the number of lines will be welcome by any mapping editor.

wmaroy commented 7 years ago

@VladimirAlexiev The extraction framework indeed deals with that but it cleans wikitext based on what type of mapping is used for that infobox property. A property mapping that maps to dbo:name will trigger a different cleaning process than a property mapping that maps to dbo:birthYear (in the current implementation of the EF). And for the property mappings there are also additional parameters possible such as prefix, suffix, select, etc.. All these are not commonly used though.

RML won't replace the EF but it contains the same cleaning functions. An option we're working on for improving readability is a shorthand version of the mappings. This omits the FunctionTermMaps for simple property mappings that only contain an infobox property parameter and an ontology parameter. The FunctionTermMap can be inferenced when processing the mapping files.

So it would look like:

<SimplePropertyMapping/1>
        a             rr:PredicateObjectMap ;
        rr:objectMap  <SimplePropertyMapping/1/ObjectMap> ;
        rr:predicate  dbo:name .

<SimplePropertyMapping/1/ObjectMap>
        a              rr:ObjectMap;
        rml:reference  "name" .

This cleans the wikitext based on <SimplePropertyMapping/1> rr:predicate dbo:name as a standard behaviour in the DBpedia context. The RML Processor would still receive the full RML mapping through inferencing. Both versions are stored in the repo, the shorthand and the full.

Additionally, the original DBpedia mapping templates are currently being put into a UI (which automatically generates RML). So basic mappings templates (including intermediate and conditional mappings) do not need to be edited manually in any case. A clear representation of the mapping will be given as well.

VladimirAlexiev commented 7 years ago

@wmaroy

The extraction framework it cleans wikitext based on what type of mapping is used for that infobox property.

This would be excellent but I'm afraid it's not true, see http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-7-4.

See https://github.com/dbpedia/extraction-framework/issues/286: object property extractor should check rdfs:range.
See https://github.com/dbpedia/extraction-framework/issues/458: ISSN wrongly treated as integer and cut prematurely.

This is hard to fix:

Mapped props have a range, raw props don't
So the extractor would need to propagate ranges backward: raw<-mapped
Whereas data flows forward: raw->mapped
the extraction do dbp: and the subsequent mapping to dbo: happen in completely separate phases
the mapping framework doesn't map Properties but Templates, so conceivably two people could map a raw prop (eg dbp:issn) to two different dbo: props having different nature (object vs data) and datatype

If your framework can extract raw props taking into account prop ranges, that will be a great improvement. Can it?

RML won't replace the EF but it contains the same cleaning functions.

I don't understand. Will these functions be used (thus replacing that part of EF) or not?

shorthand version of the mappings

That would be nice. If you need to infer full mappings from them fine, but don't store those anywhere and don't show them to people.

currently being put into a UI (which automatically generates RML)

Excellent! If that UI is as easy to use as the original mapping wiki.

Looking forward to your progress!

dbpedia / mappings-tracker

SHACL/RDFUnit for mappings #93