dcmi / dctap

DC Tabular Application Profile
https://dcmi.github.io/dctap/
33 stars 10 forks source link

Do we really need valueShape ? #64

Closed philbarker closed 2 years ago

philbarker commented 2 years ago

I find myself wondering why we don't just have a valueConstraintType of "Shape" and put the shapeID in the valueConstraint column. Meaning: value(s) of the property in instance data must be nodes that conform to the shape identified as the valueConstraint.

OTOH, I'm not that keen to redo all the work that would be affected by that change.

kcoyle commented 2 years ago

I'm trying to have a coherent thought about this because it links to my unease with the term "constraint". If we look at this from an RDF standpoint, the shape identifier in valueShape is the actual object of the triple and is, by definition, a node. Thus, the valueShape column holds an RDF node. We could say that if the valueNodeType is IRI or BNODE then what is entered in the valueConstraint is a shape. My concern is that folks not thinking in RDF wouldn't find this obvious. (I think it would be more obvious if the column header were simply "value" meaning "single value option for this property".)

Beyond RDF one could call this the only possible value of the property or the value of a name/value pair. So we have single values (shape or a single string), and we have value rules like pattern or picklist or iriStem, where there is more than a single value.

We have said that if there is a single string valueConstraint, no valueConstraintType is needed (although that seems dangerous to me, because it's always dangerous to have the absence of an element have a specific meaning). But a single value seems different to me than a picklist or a pattern. If you are creating metadata and every instance will say: "company = IBM" it feels intuitive to call that the "value" while the use of an IRI stem does feel like a constraint of sorts.

OK, this was mushy but I can't make it clearer at the moment.

tombaker commented 2 years ago

@kcoyle

I'm trying to have a coherent thought about this because it links to my unease with the term "constraint". If we look at this from an RDF standpoint, the shape identifier in valueShape is the actual object of the triple and is, by definition, a node. Thus, the valueShape column holds an RDF node.

I'm not sure I follow... As I see it, a shape identifier identifies a "shape" - a set of statement constraints, with their property (or predicate) constraints and value constraints. While the shape identifier could be used as an object node of an RDF triple, that is not the function it has in the DCTAP model. (What would such a triple say?)

Rather, the DCTAP model provides constraints on values (in the context of statement constraints) then says, in addition, that the resource that is the value in the statement being constrained is expected (or recommended) to be itself described by another set of triples, the details of which are characterized in the "shape" of the value resource ("valueShape"). This is why it would make no sense to have a "valueShape" together with a literal value; that literal value could not itself be the subject of another description (set of outgoing arcs).

Or am I misunderstanding...?

tombaker commented 2 years ago

@philbarker

I find myself wondering why we don't just have a valueConstraintType of "Shape" and put the shapeID in the valueConstraint column. Meaning: value(s) of the property in instance data must be nodes that conform to the shape identified as the valueConstraint.

Perhaps because the value being constrained by a value constraint is a single value - whether it be a literal or resource URI, and whether it be taken from a picklist or required to match a pattern. A shape (the "valueShape") does not describe a single node - the object of the statement being constrained. Rather, it constitutes the set of statement constraints that could be matched or validated against a set of triples which has the object of that triple as its subject.

In other words, I would argue they are fundamentally different things.

tombaker commented 2 years ago

So a shape can be associated with a value - as long as the value is non-literal, ie, an URI or BNode.

The value URI (or BNode) is different from the shape identifier (URI):

philbarker commented 2 years ago

I was taking a guess that for a lot of people whether the value was a single node or not wouldn't really enter in to their thinking. They would think about describing the important characteristics of a resource. So I was thinking less in terms of RDF and the terminology that we have around nodes, and more about how data for a characteristic should be provided. Thinking of a book, if you want to provide data about the publication date, xsd:date provides the rules for how to structure that; if you want to provide data about the author a shape provides the rules for how to structure that.

So I agree that they are different, but whether that difference is fundamental depends on what basis you start with.

But, what we have works, and if there is not clear consensus that this would simplify the TAP for users then I won't push for it.

tombaker commented 2 years ago

@philbarker I take your point, especially w.r.t. xsd:date, though the analogy seems a bit of a stretch (structure of a string vs structure of a graph). Also IRIStem, one of our small starter set of value constraint types, refers to the value URI, so one would have to choose between providing a value constraint type of IRIStem or of Shape.

kcoyle commented 2 years ago

I agree with @philbarker that folks may not see the string in valueShape as a graph but instead as an identifier that represents a "thing" in their metadata. It's a "form vs function" kind of thing, IMO, and people don't generally relate to the graph but to what the graph represents in their world. To me this is the difference to how one expresses one's metadata in TAP vs. what the metadata might be in its serialization. There are no actual graphs in TAP - there are just table columns.

It occurs to me that we call this valueShape and not valueShapeID - although the matching token is shapeID. And that makes me wonder if we need the "ID" on propertyID and shapeID or if we should be more clear in connecting valueShape to shapeID.

kcoyle commented 2 years ago

Another wrench in the works: I don't know that anyone refers to the thing we are calling a shape by the term "shape" (other than the ShEx and SHACL folks, but that's for coders). Definitely not the XML folks. We aren't assuming that the folks creating a TAP have an idea of how it will be rendered in code, not even those whose data may eventually be written in RDF. So the TAP needs to represent how people think about what they want to describe with metadata and, conceptually, what the metadata needs to contain, but in a step (or two) preceding the coding of the metadata. I know we have tried to avoid terms like "entity" but "entity" is the thing and "shape" is how you structure the metadata for the thing. We have defined the TAP not in terms of the "real world object" but in terms of the code for that object. I fear that we will lose people if the TAP is too far abstracted from the thinking about things and how to describe them.

Here's how the RDA folks are thinking about their metadata. And here's how OpenAire describes something that we could consider a shape, but to them it is a property with sub-properties.

I don't know how we move from this thinking to shapes, but if nothing else I think we need to define it better so that people who think in terms of properties and sub-properties or the XML "entity with attributes" can make the connection.

kcoyle commented 2 years ago

Keeping valueShape for now, as per Jan 20, 2022 meeting. Issue will be created regarding XML/JSON/etc. and shapes as values.