Closed GoogleCodeExporter closed 9 years ago
Another example, void:distinctSubjects. Consider,
:foo a :bar.
_:b1 a :bar.
_:b2 a :bar.
are there one, two or three distinct subjects here? Any
answer is correct depending on the presuppositions -
it is these presuppositions need to either be nailed
down in the spec or we need a way to express them in
the description of the statistic. (A reasonable approach
might be to nail down some common sets of presuppositions
in the spec and allow for the expression of others).
Original comment by wwai...@gmail.com
on 31 Oct 2010 at 12:21
void:statItem and in fact the entire statistics module is deprecated in the new
version.
See discussion here:
http://code.google.com/p/void-impl/issues/detail?id=18
See here for the replacement:
http://void-impl.googlecode.com/svn/trunk/guide2/index.html#statistics
My take on automated computation of statistics using RDF queries is implemented
here:
http://github.com/cygri/make-void
We don't say anything about inferencing anywhere in voiD. My take: A dataset is
a set of triples. So a triple either is contained in a dataset or it is not.
When you define your void:Dataset resource, you have to decide whether you
think of it as containing inferred triples or not, and under which entailment
regime. Ideally, there would be some void:TechnicalFeatures to announce what
you consider included.
Original comment by richard....@gmail.com
on 31 Oct 2010 at 1:21
void:statItem straggles in the Tools section (I notice the "remove this
section?" question)
in the XML below the text "After applying the liftSSM transformation the
resulting RDF/XML
document would look like:"
However my point is that the inference regime is a property of the *statistic*
not the
dataset or even the void description (although I could see it being inherited
from the
void description).
Original comment by wwai...@gmail.com
on 31 Oct 2010 at 2:07
A fuller version of my thoughts on this (submitted to SWJ just now),
On the Provenance of Linked Data Statistics
http://river.styx.org/ww/2010/10/ldstat-20101031.pdf
Original comment by wwai...@gmail.com
on 31 Oct 2010 at 4:03
@wwaites: I re-titled the issue, trying to better capture your core complaint.
You said about the fuzziness in how one could compute statistics: “It is
these presuppositions need to either be nailed down in the spec or we need a
way to express them in the description of the statistic.”
Nailing them down too far is counter-productive because a publisher's technical
setup might allow only one way of computing certain stats (e.g., if his system
automatically computes the RDFS inference closure over the dataset, then he
won't turn that off just because voiD says that raw triples are counted).
Overconstraining the statistics is not helpful.
You mention another option: to provide a way of expressing the presuppositions
in voiD itself.
But why? What's the value in that? At this point in the game, does anyone
really need to know whether DBpedia contains three million entities or five
million? What are the use cases where this makes a difference? Where are the
dataset publishers that currently go out of their way providing highly accurate
statistics about their datasets including the process how they were calculated?
Ballpark estimates for statistics, on the other hand, are clearly useful for
decision-making. Should I use the 1000-movie dataset or the 50000-move dataset?
Will this probably fit into my in-memory triple store or not? And data
providers are clearly interested in providing these numbers.
So I'm inclined to leave the deliberate vagueness unchanged, unless someone
shows the cowpath that we are supposed to pave over here.
Original comment by richard....@gmail.com
on 2 Nov 2010 at 9:37
I agree ballpark statistics are useful!
The "audience" for this is not necessarily only dataset authors but people
calculating statistics about others' datasets (like the work you guys did on
LOD statistics). This type of analysis goes far beyond ballpark estimates.
The old void:statItem gave a convenient place to hang this information since
you could annotate it with whatever information is needed to completely
describe the statistic.
Assuming void:statItem is still there (or replaced with another suitable
predicate) the first part of the question is what is the relationship between
it and void:triples, void:distinctXXX, etc..
If void:statItem is removed, some guidance on how to express statistics that
aren't defined in void itself would be useful.
Not saying that the void-spec statistics must necessarily be nailed down, but
*if* someone calculates things in a specific way, how do they express that?
This question might well be out of scope for voiD...
Original comment by wwai...@gmail.com
on 3 Nov 2010 at 11:51
void:statItem is gone. void:triples and friends are the replacement.
It is true that void:statItem was more powerful because you could add extra
annotations to it.
On the topic of defining your own statistics that go beyond what's offered in
voiD, there are two main options. If your statistics are simple, then you can
just define new properties for use on datasets in your own namespace. You could
have ex:percentageOfTypedDatasets or ex:averageLiteralLength or whatever.
If that's too simplistic, then I'd recommend the use of a dedicated statistics
vocabulary (I'd use data cube over SCOVO these days). Instead of void:statItem,
I'd make the dataset another dimension on the observations. So one observation
could say: date=2010-11-03, dataset=DBpedia, number of triples=100M, method of
calculation=educated guess.
Original comment by richard....@gmail.com
on 3 Nov 2010 at 2:06
I'll close this. I think we agree that expressing the precise method of
statistics calculation in a voiD description is important for certain use
cases, but at this point out of scope for the core voiD spec.
Original comment by richard....@gmail.com
on 22 Nov 2010 at 10:40
Original issue reported on code.google.com by
wwai...@gmail.com
on 31 Oct 2010 at 10:10