W3C-HCLSIG / HCLSDatasetDescriptions

7 stars 13 forks source link

6.6.2 Enhanced Statistics uses LinkSet wrongly #81

Closed VladimirAlexiev closed 9 years ago

VladimirAlexiev commented 10 years ago

Sec 6.6.2 uses LinkSet to provide

This is totally wrong: void:LinkSet and void:linkPredicate are used to describe links between datasets, not counts within one dataset. You should use void:propertyPartition (and maybe void:classPartition within it) and void:distinctSubjects.

KimJBaran commented 10 years ago

Cannot comment on propertyPartition, classPartition and distinctSubjects, but use of LinkSet and linkPredicate appear to be wrong.

From http://www.w3.org/TR/void/#linkset:

VoID also allows the description of RDF links between datasets. An RDF link is an RDF triple whose subject and object are described in different datasets.


The property void:linkPredicate can be used to specify the type of links that connect two datasets. In other words, it names the RDF property in the predicate position of the link triples.

The following example uses void:linkPredicate to state that the DBpedia and Geonames datasets are linked by triples that have the owl:sameAs predicate:

micheldumontier commented 10 years ago

disagree. a void:Dataset is a set of RDF triples. a void:Linkset is a collection of RDF triples between two datasets. Therefore, we can create Linksets between any arbitrary datasets.

VladimirAlexiev commented 10 years ago

Yes, but the section I'm quoting doesn't talk about 2 datasets. It appears to want to provide some stats of 1 dataset, and uses wrong class and property. See http://www.w3.org/TR/void/#class-property-partitions (as opposed to http://www.w3.org/TR/void/#describing-linksets)

micheldumontier commented 10 years ago

A dataset is any set of triples. in the formulation for the enhanced statistics, we describe a set of relations (i.e. linkset) between arbitrary partitions of a dataset. each partition is a dataset in its own right (see void:subset). i think this approach is justifiable, and falls within the scope of VoID constructs provided. You seem not to agree - could you provide an alternative formulation?

VladimirAlexiev commented 10 years ago

we describe a set of relations (i.e. linkset) between arbitrary partitions of a dataset.

Not true. Eg section properties and the number of unique objects linked to the property shows this query:

SELECT  ?p (COUNT(DISTINCT ?o ) AS ?count ) { ?s ?p ?o } GROUP BY ?p

Where do you see 2 arbitrary (i.e. independent) partitions here?

The right way to express this is (see http://www.w3.org/TR/void/#statistics):

    void:propertyPartition [
        void:property <property-uri> ;
        void:distinctObjects "###"^^xsd:integer] .

This counts any objects (URIs, blank nodes, literals), as per the above query and the VOID spec. If you want to count only resources, see http://www.w3.org/TR/void/#class-property-partitions and use rdfs:Resource (not rdfs:Class):

    void:propertyPartition [void:property <property-uri> ;
      void:classPartition [void:class rdfs:Resource;
        void:distinctObjects "###"^^xsd:integer]].

The key to understanding the above is that both void:propertyPartition and void:classPartition create sub-datasets, which are sets of triples. So it's legitimate to speak of the void:distinctObjects of those triples.

micheldumontier commented 10 years ago

We need to specify 1 - the property 2 - the subject class partition 3 - the object class partition

so the reason we started using the linkset was because of "void:subjectsTarget" and "void:objectsTarget" to specify both the subject and target class partitions. Can you elaborate on how we can get this kind of functionality using a void:propertyPartition?

VladimirAlexiev commented 10 years ago

Dear Michel,

I cannot see any query in the quoted section that reports on property and two classes. The closest query that I see is: unique subject types that are linked through a property to unique object types:

SELECT (COUNT(DISTINCT ?s ) AS ?scount ) ?p (COUNT(DISTINCT ?o ) AS ?ocount ) { ?s ?p ?o } GROUP BY ?p

It counts distinct subjects and objects per property. This can be reported as follows:

    void:propertyPartition [
        void:property <property-uri> ;
        void:distinctSubjects "###"^^xsd:integer] .
        void:distinctObjects "###"^^xsd:integer] .

However, the same query seems to want to (incorrectly) report on property and two classes:

    void:subset [
        a void:LinkSet ; 
        void:linkPredicate <property-uri> ;
        void:subjectsTarget [
            void:class <subject-type-uri> ;
            void:entities "###"^^xsd:integer ;
            void:objectsTarget [
                void:class <object-type-uri> ;
                void:entities "###"^^xsd:integer]]].

To make such a report, you need to use the http://ldf.fi/void-ext ontology (see here for a tool implementing such counts: http://jiemakel.github.io/aether/, and a paper explaning it), eg like this:

    void:propertyPartition [void:property <property-uri> ;
        void:classPartition [void:class <subject-class-uri>;
            void-ext:objectClassPartition [void:class <object-class-uri>;
                void:triples "###"^^xsd:integer]]].

Above we use:

Note that if you have some subclass or subproperty inference in the repository, those partitions won't be exclusive...

micheldumontier commented 10 years ago

so the objectClassPartition is a property of the classPartition? and the void:triples are associated with the objectClassPartition? strange.

VladimirAlexiev commented 10 years ago

void-ext:objectClassPartition is analogous to void:classPartition: they make a subset (both are subprops of void:subset). The difference is that objectClassPartition restricts the Objects of triples in the subset, whereas classPartition restricts the Subjects.

This needs to be qualified: http://www.w3.org/TR/void/#class-property-partitions says "The (classPartition) contains all triples that describe entities that have this class as their rdf:type". Is it true that the word "describe" means "have as subject"? SPARQL deliberately leaves freedom about how a "DESCRIBE ?s" query is implemented. Most repos return Concise Bounded Description (CBD), which includes all "?s ?p ?o" triples, but also all triples "?s ?p1 ?blank. ?blank ?p2 ?o" where ?blank is a blank node (recursively); and "?statement rdf:subject ?s. ?statement ?p ?o" (i.e. all reified statements about ?s). Others even return Symmetric CBD, which includes statements where ?s is Object.

objectClassPartition is a property of the classPartition?

No: objectClassPartition can be applied against and void:Dataset, no matter whether it's the result of a partition or not. The subsets being void:Dataset, you can subdivide them further. You can swap the order/nesting of the propertyPartition, classPartition, objectClassPartition and still get almost the same results. At each level, you need to describe the parameter of partition: void:property and void:class (twice).

By "almost" I refer to the ambiguity of "describe" above. You also need to be careful about literals: if your repo does not automagically declare all literals to be of class rdf:Literal, then objectClassPartition will skip all data triples (having a literal as their object). And "declare literals as rdf:Literal" means eg "123 a rdf:Literal" which is weird, because in RDF 1.0 literals cannot be the subject of a statement (maybe RDF 1.1 allows that)

micheldumontier commented 9 years ago

Hi, ok, i modified the relevant structures - see the diff here : https://github.com/joejimbo/HCLSDatasetDescriptions/compare/statistics

how does that look?

micheldumontier commented 9 years ago

@VladimirAlexiev can you have a look at the diff?

VladimirAlexiev commented 9 years ago
  1. Instead of void:entities, I think you should use void:distinctObjects or void:distinctSubjects respectively. Although void:entities is left a bit vague in the spec (number of "main entities" in a dataset), in this case it would mean "all nodes". But you want to report only the distinct nodes in object resp subject position.
  2. Cosmetic: I'd collapse all closing ] on the same line (and you don't need punctuation). So instead of this:
void:entities "###"^^xsd:integer ;

Use that:

void:entities "###"^^xsd:integer ]].


micheldumontier commented 9 years ago

@VladimirAlexiev ok, i have made the edits. can you verify the correctness for each statistic?

micheldumontier commented 9 years ago


VladimirAlexiev commented 9 years ago

Thanks for adding me to the contributors! Could you please change it to this:

<dd>Vladimir Alexiev, Ontotext Corp, Bulgaria &lt;<a href="mailto:vladimir.alexiev@ontotext.com">vladimir.alexiev@ontotext.com</a>&gt;</dd>
micheldumontier commented 9 years ago


AlasdairGray commented 9 years ago

Please ensure that the examples both within the document and hcls.ttl are updated. (Relates to issue #89)

egombocz commented 9 years ago

I'll look at the IO Informatics use case and will harmonize it in accordance with the guidelines

AlasdairGray commented 9 years ago

@egombocz I think your comment relates to issue #74

mscottm commented 9 years ago

I sent a note to Vladimir asking him to verify what Michel did (followup to https://github.com/joejimbo/HCLSDatasetDescriptions/issues/81#issuecomment-61188841).

micheldumontier commented 9 years ago

@VladimirAlexiev can you have another look at the latest?

micheldumontier commented 9 years ago

refactored statistics have now been merged as per commit e85578a9da34c2022b971141dcdb386437d3d7a4