diachron / quality

Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
MIT License
8 stars 4 forks source link

LabelsUsingCapitals metric #34

Closed clange closed 10 years ago

clange commented 10 years ago

Implement a metric LabelsUsingCapitals that identifies triples whose property is from a pre-configured list of label properties (a subset of the annotation properties from #32), and whose object uses a bad style of capitalisation.

We consider the following widely used label properties:

For now, this list of properties can be hard-coded (maybe somehow shared with #32); we might think about a more extensible implementation later.

For now we define "bad" capitalisation as "camel case", for which we should design a regular expressions to match such strings. Consider, e.g., a label "InterestingThing": this is a suitable name for a class/resource, but the label should rather be "interesting thing" or "Interesting Thing"

E.g. a triple like the following should be matched:

<http://...> <http://www.w3.org/2000/01/rdf-schema#label> "InterestingThing" .

The metric value is defined as the ratio of labels with "bad capitalisation" to all labels (i.e. all triples having such properties).

Note: in the cleaning UI, triples that match this metric should be reported as non-critical errors.

(Background: D3.1 Table 20 on page 91)

muhammadaliqasmi commented 10 years ago

LabelsUsingCapitals identifies triples whose property is from a pre-configured list of label properties, and whose object uses a bad style of capitalization list of widely used annotation properties are stored in ..src/main/resources/LabelPropertiesList.txt

metric value = total number of bad capitalization literals / total number of literals

Metric value Range = [0 - 1] Best Case = 0 Worst Case = 1

--implemented in issue#34 branch --issue#34 branch merged with master branch