dcmi / dctap

DC Tabular Application Profile
https://dcmi.github.io/dctap/
33 stars 10 forks source link

How should the TAP terms be formalized as a vocabulary? #6

Closed kcoyle closed 3 years ago

kcoyle commented 3 years ago

We probably should formalize the TAP vocabulary for the terms that we have defined. The use of TAP is not limited to RDF, however. What would be most useful?

kcoyle commented 3 years ago

Formalizing the vocabulary as RDF would allow us to add friendlier labels and also to provide language-specific labels.

tombaker commented 3 years ago

@kcoyle I'm of two minds about this. We are already, in effect, identifying DCTAP elements with what were referred to in the early DC years as "tokens" - strings that look like English words and are intended to be used for machine processing. The Dublin Core of 1998 distinguished the "descriptive name" of an element (eg, "Subject and Keywords") from its "formal single-word label", which was intended for use in "syntactic specifications". By 2000 this had morphed into a distinction unique tokens ("names") and human-readable "labels".

The current draft of the DCTAP Vocabulary does not make this distinction, so in effect, strings like 'propertyID' serve both as human-readable labels and as tokens (CSV column headers, ie, for machine processing). However, it would be easy enough to consider "propertyID" to be the token and add a label, "Property ID".

We could do this without necessarily using those tokens to coin URIs or describing those URIs in RDF. To describe them in RDF, we would need to take a stand on which elements are properties and which classes. (If the only requirement were to serve as anchors for multilingual labels then we could, in principle, even define them all as SKOS concepts, or even just as instances of RDF Resource, the default for all resources - though we wouldn't really need to use RDF just to do that.)

In deciding how to define the elements in a formal sense, we could take hints from the RDF vocabularies for ShEx (in Turtle) and SHACL. However, both of those specifications are backed by formal semantic specifications, and their RDF vocabularies are meant to be used for expressing ShEx schemas and SHACL graphs in RDF (which is for ShEx an option, as an alternative to JSON, but is for SHACL, I believe, a requirement).

Doing this for DCTAP, however, would imply that DCTAP has its own formal semantics, which is not the case, nor do I think we should consider going there. Or it would at a minimum imply an invitation for people to express their application profiles in RDF using our URIs, which I also think we should avoid (because ShEx and SHACL already provide RDF vocabularies for doing that). Though it remains to be proven (especially as we flesh out the model), DCTAP defines an information carrier that seems generic and simple enough that it should be mappable both to schema languages that do use URIs (such as ShEx and SHACL) and to schema languages that do not use URIs, such as JSON Schema, which uses "keywords" (which are like "tokens").

As this stage, DCTAP simply defines an information carrier - a data structure for holding information that can become actionable for purposes such as validation only if translated into one of the schema languages that do support such actions. If APs in DCTAP/CSV really were to have well-defined translations into multiple schema languages - a feat that remains to be demonstrated - then we will have achieved something quite useful. If those translations were to reflect a coherent interpretation of details such as the meaning of cardinality or of open/closed, so much the better.

Providing labels and translations in multiple languages, on the other hand, could potentially be done by generating documents from a common source using a template language.

I propose that we hold off, for now, on defining DCTAP as an RDF vocabulary until we better understand what requirements such a vocabulary would meet, and focus on other current issues such as "closing" shapes, multiple values in a cell, and fleshing out a basic set of value constraint types.

kcoyle commented 3 years ago

Ooops, too late. I just added a bare bones vocab to the site. I honestly think that folks will want one, because that is the practice today. If you remember, this was a question asked on the call, and there were +1s from the audience for creating a vocab. I'll send out an email with more info.

kcoyle commented 3 years ago

@tombaker Here are my thoughts on the subject:

You are right that it may not make sense to have an RDF vocabulary that is exactly equivalent to the tabular format elements. However, I disagree that we do not have "formal semantics" - the vocabulary document does provide those, albeit not in a form like EBNF (although what is there could be translated to a formal form). They are "mild" compared to something like ShEx but ShEx has to cover some pretty complex computation, and this is a pretty straight-forward, simple vocabulary.

I do see some reasons to create an IRI-based vocabulary, and I can see some options here:

  1. Have a vocabulary that is identical to the TAP, even the label elements. This gives something concrete for folks to work with outside of the CSV file, and insures interoperability because each element will have a URI-based identifier. I anticipate that the point at which people are working with each other's profiles will be after the CSV has been converted to something actionable, so identification beyond a string will be key.
  2. Create a vocabulary that is based on the profile elements and concept of TAP but that isn't identical to the tabular profile (e.g. the label elements). This could be the best route to extending the vocabulary to more complex needs.
  3. Do 1, but in usage people could do 2. That is, a basic vocabulary that is equivalent to TAP could be extended, and if desired users could make use of the RDF/OWL labels rather than the TAP label columns. Since those are optional, this would still be valid.

We might need another name for this vocabulary since it isn't limited to "tabular" profiles. However, I think that limiting the vocabulary to a single physical design is not as useful as creating a way (whatever that is) to document simple profiles for humans and machines. I think that "tabular" has helped us because it gave us borders for our work, but I also see it as a jumping-off point, via a vocabulary for application profiles.

kcoyle commented 3 years ago

I had another thought about a vocabulary that I think reinforces our work on the TAP and in general supports application profiles: a vocabulary belongs to the owner of the vocabulary, and only the owner can amend it. Others can reuse the vocabulary, but RDF provides no way for downstream users to add to the terms themselves: labels, notes, domains, ranges. To me this is a rather large gap in our data environment, and definitely calls for a way (aka: profiles) to add on to vocabulary terms that one is reusing.

Ideally (IMO) vocabulary definition would be a stack:

Therefore, the labels and notes in our profile are really the only way that a downstream user can add their own displays to the vocabularies they are reusing.

We don't explicitly include domains and ranges, and there is the potential for conflict there that there isn't with display labels. However, the columns that define values cover that aspect to an extent.

tombaker commented 3 years ago

@kcoyle

Others can reuse the vocabulary, but RDF provides no way for downstream users to add to the terms themselves: labels, notes, domains, ranges.

This is true in the sense that people cannot edit the files where terms are defined. In the RDF environment, however, anyone can make assertions about anything, including term URIs that they do not own. For example, users of Protégé who want Dublin Core to be defined in OWL can publish additional assertions about DCMI metadata terms defining them as OWL annotation properties. Whether this is a good practice in general is unclear, but it has met a need for OWL users without causing any perceptible harm to DC users in general. (See also here.)

You suggest that vocabulary definition could be a stack, but one might respond that since application profiles already provide a way to annotate vocabularies with labels and notes (as you point out), application profiles already function in effect as an additional layer for defining vocabularies - at any rate from the perspective of users of those vocabularies.

kcoyle commented 3 years ago

You suggest that vocabulary definition could be a stack, but one might respond that since application profiles already provide a way to annotate vocabularies with labels and notes (as you point out), application profiles already function in effect as an additional layer for defining vocabularies - at any rate from the perspective of users of those vocabularies.

Yes, that's what I meant: that APs are an additional layer. I'm not sure that they are the only additional layer needed - that is, if they resolve all of the needs beyond a base vocabulary. I'm also not sure if a single layer can cover all of the needs in a useful way. Adding non-conflicting domains and ranges is one possible need that isn't fully covered in the APs as we have defined so far. This is another area where I think we could enumerate what is needed, even if we don't try to cover it all with TAP.

tombaker commented 3 years ago

@kcoyle

Adding non-conflicting domains and ranges is one possible need that isn't fully covered in the APs as we have defined so far.

But domains and ranges are really just about inferencing, as we wrote awhile ago in a Library Hi Tech article and Keven Ford @kefo has more recently nicely explained.

Instead of formal RDFS/OWL domains and ranges, should we not perhaps point people more to Schema.org (ie, 'rangeIncludes'), which is AFAICT how people actually think about this.

Would "adding non-conflicting domains and ranges", as you see it, be more about enableing inferencing or about annotation (ie, communicating intention)?

kcoyle commented 3 years ago

RDF vocabulary is accepted in theory; final vocabulary namespace is awaiting DCMI decision (compounded by problems with PURLs).