predicate for expressing dataset entry points

GoogleCodeExporter commented 9 years ago

Consider for example, http://semantic.ckan.net/catalogue
This is not a dump of the entire dataset, nor it is really
an example resource. What it contains are links to all of
the dcat:CatalogRecords in the catalogue -- if one wanted
to crawl the catalogue, this would be an appropriate
starting point, by following the object links coverage of
the entire dataset (which itself spans multiple graphs), is
guaranteed.

Suggest void:entrypoint for this purpose, somewhere
betweek void:dataDump and void:exampleResource.

Original issue reported on code.google.com by wwai...@gmail.com on 31 Oct 2010 at 9:49

Blocking: #85, #82

GoogleCodeExporter commented 9 years ago

Sounds like a good idea. +1 to address this; need to discuss if in MR2 or later 
- this is related to Issue 63.

Original comment by Michael.Hausenblas on 31 Oct 2010 at 10:03

GoogleCodeExporter commented 9 years ago

what would the range and definition be? As a consumer of voiD data, what can I 
expect to be able to do with the value of this property?

Original comment by K.J.W.Al...@gmail.com on 31 Oct 2010 at 11:45

GoogleCodeExporter commented 9 years ago

I think,

void:entrypoint rdfs:range void:Dataset.

and,

{ ?d void:entrypoint ?s } => { ?d void:subset ?s }.

so maybe,

void:entrypoint rdfs:subPropertyOf void:subset.

A consumer would expect the value of this property to dereference,
and additionally to be able to dereference all objects there, that 
are in the dataset, so,

 1. dereference the entrypoint/subset, put in store
 2. find links en the entrypoint/subset, ?e with a query like,

    SELECT ?o WHERE
    {
      ?d void:subset ?e .
      ?d void:uriRegexp ?r .
      GRAPH ?e { ?s ?p ?o } 
      FILTER (regexp(?o, ?r))
    }
 3. repeat until no new graphs are encountered

The difference from void:subset is that applying this algorithm
to void:entrypoint should guarantee that the crawl is complete.

Original comment by wwai...@gmail.com on 31 Oct 2010 at 12:03

GoogleCodeExporter commented 9 years ago

The name and description of the property should make clear that the motivation 
here is about crawling the entire dataset.

Perhaps call it void:entryResource? void:rootResource? void:topResource?

And something along these lines:

“Many datasets are structured in a tree-like fashion, with one or few natural 
“top concepts” or “entry points”, to which all other entities are 
connected through a small number of steps. Using this property implies 1. that 
the entry resource is a central entity of particular importance in the dataset; 
and 2. that the entire dataset can be crawled by resolving the entry resource 
and recursively following links to other URIs in the retrieved RDF response.”

I wouldn't relate it to void:subset, because that would imply that the object 
is a void:Dataset, and would preclude the use of top entities in an entity 
hierarchy, like say a skos:ConceptScheme or foaf:Organization. So I'd just 
leave the range open.

But domain should be void:Dataset obviously, and perhaps make it a subproperty 
of void:exampleResource?

Original comment by richard....@gmail.com on 31 Oct 2010 at 1:12

GoogleCodeExporter commented 9 years ago

cygri wrote:
  > I wouldn't relate it to void:subset, because that would imply that the object is a void:Dataset

Isn't it the case that anything you get by dereferencing any URI in the dataset 
(modulo
matching uriRegexp) is a void:subset? I read void:subset as almost equivalent to
rdfg:subGraph...

Original comment by wwai...@gmail.com on 31 Oct 2010 at 2:05

GoogleCodeExporter commented 9 years ago

@wwaites: No. A foaf:Person has a dereferenceable URI, but definitely is not a 
void:Dataset.

Also, quoting the definition of void:Dataset:

“A dataset is a set of RDF triples that are published, maintained or 
aggregated by a single provider. Unlike RDF graphs, which are purely 
mathematical constructs [RDF Concepts], the term dataset has a social 
dimension: We think of a dataset as a meaningful collection of triples, that 
deal with a certain topic, originate from a certain source or process, are 
hosted on a certain server, or are aggregated by a certain custodian. Also, 
typically a dataset is accessible on the Web, for example through resolvable 
HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many 
triples that there is benefit in providing a concise summary.”

That should explain the difference between void:subset and rdfs:subGraph. I 
think that Section 1.4 also motivates that distinction quite well.

If you have any issue (no pun intended) with that take on void:subset, please 
create a separate issue for it.

Original comment by richard....@gmail.com on 31 Oct 2010 at 2:34

GoogleCodeExporter commented 9 years ago

After some more thinking, I want this feature. This could be useful in our 
lodcloud work to define the notion of “bulk accessibility”: To be bulk 
accessible, a dataset must have a dump or a crawl entry point that allows 
complete crawling. (Cue discussion about whether a SPARQL endpoint enables bulk 
access.)

So +1 for doing this, and doing it in Release 2.0.

Original comment by richard....@gmail.com on 2 Nov 2010 at 9:40

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Proposed text for a new section 1.11 is below, as well as proposed RDFS.

1.11 Root resources

Many datasets are structured in a tree-like fashion, with one or a few natural 
“top concepts” or “entry points”, and all other entities reachable from 
these root resources in a small number of steps.

One or more such root resources can be named using the void:rootResource 
property. Naming a resource as a root resource implies 1. that the it is a 
central entity of particular importance in the dataset; and 2. that the entire 
dataset can be crawled by resolving the root resource(s) and recursively 
following links to other URIs in the retrieved RDF response.

Root resources make good entry points for crawling an RDF dataset.

This property is similar to void:exampleResource. While void:exampleResource 
names particularly representative or typical resources in the dataset, 
void:rootResource names particularly important or central resources that make 
good entry points for navigating the dataset.

void:rootResource a rdf:Property;
    rdfs:label "Root Resource";
    rdfs:comment "A resource of particular importance in a dataset. All resources in a dataset can be reached by following links from its root resources in a small number of steps.";
    rdfs:domain void:Dataset;
    .

Original comment by richard....@gmail.com on 24 Nov 2010 at 8:38

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Looks good to me. +1 to implement it in the guide/voc

Original comment by Michael.Hausenblas on 25 Nov 2010 at 8:15

GoogleCodeExporter commented 9 years ago

We decided in today's teleconference to implement the proposal from Comment 8, 
pending acceptance from Keith

Original comment by richard....@gmail.com on 7 Dec 2010 at 11:50

Added labels: Milestone-Release2.0

GoogleCodeExporter commented 9 years ago

We decided in today's teleconference to implement the proposal from Comment 8, 
pending acceptance from Keith

Original comment by richard....@gmail.com on 7 Dec 2010 at 11:50

GoogleCodeExporter commented 9 years ago

Implemented in r157. Closing.

Original comment by richard....@gmail.com on 7 Dec 2010 at 12:20

Changed state: Fixed

cygri / void

predicate for expressing dataset entry points #78