Specify in detail how a statistic was calculated (statistics provenance)

GoogleCodeExporter commented 9 years ago

Is it true, for example, that

{ ?x void:triples ?n } <=> { ?x void:statItem [ scovo:dimension 
void:numOfTriples; rdf:value ?n ] }

(modulo the unbound bnode on the RHS that will generate an infinite closure)

How do we ensure that ?n is a meaningful and stable value? Is it an estimate? 
Does one pull the entire dataset into a store and do SPARQL SELECT COUNT (*)? 
What kind of inferencing if any is supported/enabled in the store? How does one 
express this necessary information about the provenance of this statistic?

Original issue reported on code.google.com by wwai...@gmail.com on 31 Oct 2010 at 10:10

GoogleCodeExporter commented 9 years ago

Another example, void:distinctSubjects. Consider,

    :foo a :bar.
    _:b1 a :bar.
    _:b2 a :bar.

are there one, two or three distinct subjects here? Any
answer is correct depending on the presuppositions -
it is these presuppositions need to either be nailed 
down in the spec or we need a way to express them in
the description of the statistic. (A reasonable approach
might be to nail down some common sets of presuppositions
in the spec and allow for the expression of others).

Original comment by wwai...@gmail.com on 31 Oct 2010 at 12:21

GoogleCodeExporter commented 9 years ago

void:statItem and in fact the entire statistics module is deprecated in the new 
version.

See discussion here:
http://code.google.com/p/void-impl/issues/detail?id=18

See here for the replacement:
http://void-impl.googlecode.com/svn/trunk/guide2/index.html#statistics

My take on automated computation of statistics using RDF queries is implemented 
here:
http://github.com/cygri/make-void

We don't say anything about inferencing anywhere in voiD. My take: A dataset is 
a set of triples. So a triple either is contained in a dataset or it is not. 
When you define your void:Dataset resource, you have to decide whether you 
think of it as containing inferred triples or not, and under which entailment 
regime. Ideally, there would be some void:TechnicalFeatures to announce what 
you consider included.

Original comment by richard....@gmail.com on 31 Oct 2010 at 1:21

GoogleCodeExporter commented 9 years ago

void:statItem straggles in the Tools section (I notice the "remove this 
section?" question)
in the XML below the text "After applying the liftSSM transformation the 
resulting RDF/XML
document would look like:"

However my point is that the inference regime is a property of the *statistic* 
not the
dataset or even the void description (although I could see it being inherited 
from the
void description).

Original comment by wwai...@gmail.com on 31 Oct 2010 at 2:07

GoogleCodeExporter commented 9 years ago

A fuller version of my thoughts on this (submitted to SWJ just now),

On the Provenance of Linked Data Statistics
http://river.styx.org/ww/2010/10/ldstat-20101031.pdf

Original comment by wwai...@gmail.com on 31 Oct 2010 at 4:03

GoogleCodeExporter commented 9 years ago

@wwaites: I re-titled the issue, trying to better capture your core complaint.

You said about the fuzziness in how one could compute statistics: “It is 
these presuppositions need to either be nailed down in the spec or we need a 
way to express them in the description of the statistic.”

Nailing them down too far is counter-productive because a publisher's technical 
setup might allow only one way of computing certain stats (e.g., if his system 
automatically computes the RDFS inference closure over the dataset, then he 
won't turn that off just because voiD says that raw triples are counted). 
Overconstraining the statistics is not helpful.

You mention another option: to provide a way of expressing the presuppositions 
in voiD itself.

But why? What's the value in that? At this point in the game, does anyone 
really need to know whether DBpedia contains three million entities or five 
million? What are the use cases where this makes a difference? Where are the 
dataset publishers that currently go out of their way providing highly accurate 
statistics about their datasets including the process how they were calculated?

Ballpark estimates for statistics, on the other hand, are clearly useful for 
decision-making. Should I use the 1000-movie dataset or the 50000-move dataset? 
Will this probably fit into my in-memory triple store or not? And data 
providers are clearly interested in providing these numbers.

So I'm inclined to leave the deliberate vagueness unchanged, unless someone 
shows the cowpath that we are supposed to pave over here.

Original comment by richard....@gmail.com on 2 Nov 2010 at 9:37

Changed title: Specify in detail how a statistic was calculated (statistics provenance)

GoogleCodeExporter commented 9 years ago

I agree ballpark statistics are useful!

The "audience" for this is not necessarily only dataset authors but people 
calculating statistics about others' datasets (like the work you guys did on 
LOD statistics). This type of analysis goes far beyond ballpark estimates.

The old void:statItem gave a convenient place to hang this information since 
you could annotate it with whatever information is needed to completely 
describe the statistic.

Assuming void:statItem is still there (or replaced with another suitable 
predicate) the first part of the question is what is the relationship between 
it and void:triples, void:distinctXXX, etc..

If void:statItem is removed, some guidance on how to express statistics that 
aren't defined in void itself would be useful.

Not saying that the void-spec statistics must necessarily be nailed down, but 
*if* someone calculates things in a specific way, how do they express that? 
This question might well be out of scope for voiD...

Original comment by wwai...@gmail.com on 3 Nov 2010 at 11:51

GoogleCodeExporter commented 9 years ago

void:statItem is gone. void:triples and friends are the replacement.

It is true that void:statItem was more powerful because you could add extra 
annotations to it.

On the topic of defining your own statistics that go beyond what's offered in 
voiD, there are two main options. If your statistics are simple, then you can 
just define new properties for use on datasets in your own namespace. You could 
have ex:percentageOfTypedDatasets or ex:averageLiteralLength or whatever.

If that's too simplistic, then I'd recommend the use of a dedicated statistics 
vocabulary (I'd use data cube over SCOVO these days). Instead of void:statItem, 
I'd make the dataset another dimension on the observations. So one observation 
could say: date=2010-11-03, dataset=DBpedia, number of triples=100M, method of 
calculation=educated guess.

Original comment by richard....@gmail.com on 3 Nov 2010 at 2:06

GoogleCodeExporter commented 9 years ago

I'll close this. I think we agree that expressing the precise method of 
statistics calculation in a voiD description is important for certain use 
cases, but at this point out of scope for the core voiD spec.

Original comment by richard....@gmail.com on 22 Nov 2010 at 10:40

Changed state: WontFix

cygri / void

Specify in detail how a statistic was calculated (statistics provenance) #79