W3C-HCLSIG / HCLSDatasetDescriptions

7 stars 13 forks source link

Concerning the statistics 6.6.2.6 #121

Open yayamamo opened 9 years ago

yayamamo commented 9 years ago

Hi, The spec of 6.6.2.6 defines the unique numbers of subjects and objects w.r.t a predicate. This shows one aspect of the triples connecting two classes, but another cannot be obtained. It is the unique number of triples connecting the two classes. More properly, it specifies the number of unique triples that connects typed subjects and objects, which belong to certain classes, respectively.

One extreme example is that 100 different subjects have an identical property. The former says that 100 distinctSubjects and 1 distinctObject(s) while the latter says 100 triples. Another example is that each of 10 different subjects has an identical set of 10 properties. The former says that 10 distinctSubjects and 10 distinctObjects while the latter says 100 triples.

I think the latter statistics is also useful to know the characteristics of the target dataset, and I feel this was on the document before, wasn't it?

micheldumontier commented 9 years ago

The intent of 6.6.2.6 is to capture the total number of triples between subjects and objects of a specified type e.g. 100 distinct subjects may be connected to 10 distinct objects via 100 triples. One way of dealing with the total number of triples between subjects and objects of a certain type would be to simply declare a property partition on "rdfs:property".

yayamamo commented 9 years ago

I cannot fully understand what the meaning of to declare a property partition on "rdfs:property". My previous comment may be vague, and a statistic what I'd like to know is the number of a certain predicate that connects specific classes (i.e., :c1 and :c2 in the example below). If the predicate connects these classes only, the number is identical to that of the predicate.

SELECT ?p (COUNT(?p) AS ?rc)
    WHERE {
      GRAPH :graph {
        ?s ?p ?o .
        ?s a :c1 .
        ?o a :c2 .
      }}
  GROUP BY ?p
micheldumontier commented 9 years ago

6.6.2.6 does just this, does it not?

http://htmlpreview.github.io/?https://github.com/indiedotkim/HCLSDatasetDescriptions/blob/master/Overview.html#s6_6

yayamamo commented 9 years ago

I don't think so. The difference is what I wrote at the top of this comment. Former is 6.6.2.6, and the latter is the query I wrote just above.

count(distinct ?s) = 100, count(distinct ?o) = 1, count(?p) = 100

One extreme example is that 100 different subjects have an identical property. The former says that 100 distinctSubjects and 1 distinctObject(s) while the latter says 100 triples.

count(distinct ?s) = 10, count(distinct ?o) = 10, count(?p) = 100

Another example is that each of 10 different subjects has an identical set of 10 properties. The former says that 10 distinctSubjects and 10 distinctObjects while the latter says 100 triples.

micheldumontier commented 9 years ago

so 6.6.2.2 talks about properties and number of triples. This query is not, however, limited to the subject and object being of some arbitrary type - we imagine that this is necessarily true.

SELECT ?p (COUNT(?p) AS ?triples) { ?s ?p ?o } GROUP BY ?p

yayamamo commented 9 years ago

That is to say, would 6.6.2.6 be as follows?

:rdfdataset
    void:propertyPartition [
        void:property <property-uri> ;
        void:triples "###"^^xsd:integer ;
        void:classPartition [
            void:class <subject-class-uri> ;
            void:distinctSubjects "###"^^xsd:integer ;
        ];
        void-ext:objectClassPartition [
            void:class <object-class-uri> ;
            void:distinctObjects "###"^^xsd:integer ;
        ];
    ] .
SELECT (COUNT(DISTINCT ?s) AS ?scount) ?stype ?p (COUNT(?p) AS ?pcount) ?otype  (COUNT(DISTINCT ?o) AS ?ocount)  
{ 
 ?s ?p ?o . 
 ?s a ?stype .
 ?o a ?otype .
} GROUP BY ?p ?stype ?otype
micheldumontier commented 9 years ago

yes that's right