Describing a state of a dataset

GoogleCodeExporter commented 9 years ago

AFAIK, currently it is not possible to describe the (deployment/publishing) 
state of a dataset. This issue was raised in a thread on SemanticOverflow [1]. 
I guess, VoID is a quite good place for offering terms to enable such a 
knowledge representation, e.g., starting with a simple property such as 
void:status.

http://www.semanticoverflow.com/questions/3983/vocabularity-describing-the-state
-of-a-resource-draft-published

Original issue reported on code.google.com by zazi0...@googlemail.com on 2 Apr 2011 at 11:30

GoogleCodeExporter commented 9 years ago

PS: it seems that I cannot change the properties of an issue. So don't blame me 
for that ;)

Original comment by zazi0...@googlemail.com on 2 Apr 2011 at 11:32

GoogleCodeExporter commented 9 years ago

What would possible values of such a property be?

Original comment by richard....@gmail.com on 3 Apr 2011 at 8:41

GoogleCodeExporter commented 9 years ago

A non-answer might be to look at the "state" property on a CKAN dataset, which 
can take on the values "active", "deleted" and "pending". This is obviously 
underspecified and what is really going on is trying to describe the workflow 
of dataset (or rather metadata - important distinction) curation. How deeply is 
it practical to model this? How is the fact that different sources will have 
different workflows handled?

Not entirely unrelated, I've been searching for a suitable predicate to hold 
the "revision_id" which is, for CKAN a uuid such that the tuple ("package_id", 
"revision_id") uniquely identifies a particular version of a metadata record, 
but I can easily imagine such a construct for talking about the data themselves.

Is this more in the realm of DCat than voiD I wonder...

Original comment by wwai...@gmail.com on 3 Apr 2011 at 9:00

GoogleCodeExporter commented 9 years ago

@cygri: from the question that was raised on SemanticOverflow something like 
"alpha", "beta", ...

Original comment by zazi0...@googlemail.com on 3 Apr 2011 at 12:43

GoogleCodeExporter commented 9 years ago

@zazi: No, alpha/beta were proposed for software, not for datasets.

I'd like to see:

a) an actual proposal what the values of that property would be
b) a use case or two
c) some examples of real datasets where the publisher actually announces some 
sort of status of the dataset; it doesn't seem to be so typical/widespread.

Original comment by richard....@gmail.com on 3 Apr 2011 at 7:31

GoogleCodeExporter commented 9 years ago

@cygri: yes, you are right, "draft" and "published" were the proposed on for 
dataset. Generally, it might be the best to contact ngn 
(http://www.semanticoverflow.com/users/1099/ngn) re. this issue, since she 
requested it. I just thought that is might probably be useful to raise this 
issue here, too. However, I only delegated it to this place ;)

Original comment by zazi0...@googlemail.com on 3 Apr 2011 at 8:21

GoogleCodeExporter commented 9 years ago

As I have asked the question on semanticoverflow.com , I'd like to explain the 
context.

We are developing a framework of REST services in particular domain. Every REST 
resource is an RDF resource as well, and has an RDF representation (among 
others) , according to certain ontology. One of the type of resources is a 
dataset (for the record, a dataset of chemical compounds and associated 
experimental or calculated properties). 

As datasets and dataset content (the triples, describing the dataset) are 
dynamically generated  (created, uploaded or modified) by clients (human users, 
client applications or other services), we would like to have means to assign a 
status to the dataset (and to other resources as well, as for example new 
predictive models). As an example, the resource in question might be just a 
test dataset, uploaded by an user to see how the system works, or one, which 
underwent series of processing and there is a consensus of its quality. 

Thus, what I am looking for 
1) is not a status of a fixed CKAN dataset, that changes rarely (e.g. with next 
release).

2) is not a status of "RDF class or property" as in 
http://www.w3.org/2003/06/sw-vocab-status/  (although its simplicity is 
appealing), we would like to label an RDF individual instead of a class. 

3) is not exactly a status of an ontology. The classes and properties of the 
domain ontology are defined and versioned, as part of the definition of the web 
services framework, but instances are dynamically generated and modified.

4) Publishing Status Ontology looks like the closest match , but PSO assumes 
the publishing work is a document, not arbitrary RDF individual (and a set of 
associated properties).

I agree this is not a widespread use case, mostly because there are not many 
frameworks like ours, which do not follow the mainstream practice of 
centralized triple storages with relatively stable content,  where the issue 
could be handled via labeling the entire release for example. 

The RDF being "one big graph" becomes an obstacle,  if one would like to 
version / assign status to a dynamically generated subset of individuals. I am 
really interested on opinions what could be the correct solution.

Original comment by jeliazko...@gmail.com on 5 Apr 2011 at 8:31

GoogleCodeExporter commented 9 years ago

As far as I understand PSO, it is not only restricted to documents, see "PSO, 
the Publishing Status Ontology, is an ontology written in OWL 2 DL for 
characterizing the publication status of a document or other publication entity 
..." [1], i.e., every entity (resource) that is published (in a publishing 
process) is a publication entity, or?

[1] http:/purl.org/spar/pso

Original comment by zazi0...@googlemail.com on 5 Apr 2011 at 10:15

GoogleCodeExporter commented 9 years ago

I have a use case where I want to say that a previously existing dataset has 
been “decommissioned”. The issue discussed here could offer a solution for 
this use case.

Original comment by richard....@gmail.com on 13 Oct 2012 at 9:10

cygri / void

Describing a state of a dataset #102