diachron / quality

Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
MIT License
8 stars 4 forks source link

Fix semantics of UndefinedClassesOrProperties metric #30

Open clange opened 10 years ago

clange commented 10 years ago

The predicate of a quad can be an undefined property, and the object of a quad can be an undefined class or an undefined property when the quad's predicate is one out of the list given below.

The subject of a quad never references classes or properties in external vocabularies, so we don't have to analyse the subject for this metric.

This is the list of predicates that indicate that the object must be a defined class:

This is the list of predicates that indicate that the object must be a defined property:

In all of the cases above, "being defined" may also mean "defined in the current LOD dataset" (but we can assume that a class/property is defined at an earlier position in the current dataset, i.e. at a position that we have processed already). I.e. "being defined" does not only mean "defined in some external ontology".

FYI there are some more predicates for which we don't know whether the object is expected to be a class or property, but we'll ignore these predicates for now.

BTW, the current implementation for predicate and object looks a bit redundant to me; maybe we can shorten it by factoring out some of the common source code lines into a shared method.

clange commented 10 years ago

I have improved the documentation of the compute method in https://github.com/diachron/quality/commit/1c1aaa592df13758fbc1af9cd6c1afe9e8ff1828, but identified some more issues, so I'll keep this issue open for now, and will rewrite the description.

clange commented 10 years ago

@muhammadaliqasmi I saw you have already made some progress, which is great.

However let me remind you that when doing checks of the type “the predicate of the triple is ”, one always has to compare the full URI of the property, and it's case-sensitive. Considering https://github.com/diachron/quality/blob/master/src/main/java/de/unibonn/iai/eis/diachron/qualitymetrics/intrinsic/consistency/UndefinedClassesOrProperties.java#L111 it would be correct to to something like "http://www.w3.org/1999/02/22-rdf-syntax-ns#type".equals(tmpURI). Similar to the way you implemented LabelsUsingCapitals, just that this time we can really hard-code the names of the properties instead of using a configuration file. When I use the notation rdf:type above, it's really just because I'm too lazy to type out the full URI.

In the next few minutes I will add a list of OWL properties above.

clange commented 10 years ago

To the description above I have now added the OWL properties that indicate that the expected object should be a defined class or property.

clange commented 10 years ago

I'm sorry I didn't post this earlier, but when I inspected this class once more (in this version) I noticed that property names are not handled in a case-sensitive way (e.g. here). However in the RDF data model all URIs are case sensitive. When fixing these, please also think of other metrics where the same thing may need fixing – thanks!

nfriesen commented 10 years ago

I created two different metrics: one fpr undefined classes and one for the undefined properties. Important to notice: the undefined classes metric as it is defined below computes ratio of undefined classes in object position.