Shapes and hierarchical metadata schemes

kcoyle commented 2 years ago

First, a shape as we see it in RDF, and how it is rendered in the TAP: valueshape

kcoyle commented 2 years ago

A schema that uses a hierarchy with nested elements, like XML, does not have an arc with a property that connects the shapes - it instead uses the nesting. To connect them correctly in the same way they are linked in the metadata schema, we may need an additional column that indicates "sub" or "super" relationships. valueshapexml

philbarker commented 2 years ago

If your top diagram represents instance data, I think the ovals should be «Book» and «Author» (instances of the Book and Author class), and the shapeIDs should be BookShape and AuthorShape, and we have no way in TAP to say how we know to apply BookShape to instance of Book or AuthorShape to instances of Author. Though I have thought about trying to draw a TAP as if it were paper with bits cut out where the arrows and ovals of an RDF instance diagram should be. To my mind then, there's no difficulty of not having shapes in XML data because they are not in RDF data either.

To find the commonality between XML, RDF and TAPs it might be worth thinking of something more like a UML diagram: I think this one mostly works for the same example (not exactly, it's one I made earlier) Simple book application profile (1) Roughly speaking, the shapes are the boxes (shapeID not shown, but how to relate them to data classes is shown), and statement constriants are either arrows joining the boxes or attributes listed in the lower section of the box.

So whereas shapes in the tap define rules for statements in RDF (the property to use as the predicate, and possible values for the objects), for XML shapes defines rules for the content and attributes of elements. Where the tap represents a relationship between two objects, that is represented in XML as nested elements, i.e. the content of one element is other elements.

More fully:

StatementConstraint: defines rules for a part of the content of an XML element or attribute.
Shape: set of rules for element and attribute content.
propertyID: the element or attribute for which the rules are defined: this might be in the form of an XPath to where you would find it.
valueNodeType: content type, simple text or nested element for elements, text for attributes.
valueDataType: for xsd data type for simple text content.
valueConstraint: other rules for values of simple text.
valueShape: the shape to use when a value is nested XML elements

There is slightly gnarly point around StatementConstraint and Shape being very similar, which results from XML relying on nesting rather than other relationships, so statements become individual parts of the content of elements.

BTW, the diagram above is drawn in lucidChart, which allows export in CSV; I drew it because I'm pondering whether it would be possible to convert that export into a TAP / SHACL and so on. I guess if we could agree on how to map such a diagram to XML Schema (or schematron) and how to map it to TAP then we would be winning.

kcoyle commented 2 years ago

@philbarker Thanks. Looking at what I did, above, even I don't agree with it. So here's my second attempt. First, RDF instance data followed by a TAP: Then XML data followed by a TAP. What I intended with this is to show that the authorShape is not a value for any properties/elements in the bookShape, so calling it valueShape is odd. valueshapexml

philbarker commented 2 years ago

"authorShape is not a value for any properties/elements in the bookShape, so calling it valueShape is odd."

I could live with saying that

authorShape defines the content of the author (or creator) element
the content of an element is its value
the elements are "nodes" in the hierarchy

but perhaps I am missing the point. Also it's a long time since I did anything in XML so I've lost some of the idiom. I'm interested in what @johnhuck thinks.

johnhuck commented 2 years ago

I'm not sure if I've absorbed all of the issues here, but I'm not seeing why you couldn't apply the same approach we use for regular shapes to nested structures:

bookShape | author | authorShape authorShape | name |

Maybe there's something I don't understand. However, I don't find it odd to call authorShape a valueShape, because a TAP shape is only an informal entity that exists in the context of a TAP model. It isn't an RDF class, although we probably intuitively think of shapes as being like that. So for me, the purpose of the shapeID and valueShape columns are to reference each other, but that's it.

The problem I ran into (and tried to solve) with my other XML modelling attempt was that each element in a nested structure can potentially have a cardinality or other restriction, including wrapper elements, so in my solution I made sure each element could have its own row. That's maybe tangential to this question, but that's the background on my thinking.

tombaker commented 2 years ago

@johnhuck

I don't find it odd to call authorShape a valueShape, because a TAP shape is only an informal entity that exists in the context of a TAP model. It isn't an RDF class, although we probably intuitively think of shapes as being like that. So for me, the purpose of the shapeID and valueShape columns are to reference each other, but that's it.

+1 - especially: "the purpose of the shapeID and valueShape columns are to reference each other, but that's it."

Like @philbarker , I could live with the notion that the content of an element is its value or rather, I defer to XML experts as to whether this is a reasonable thing to say.

The Shape ID really just names a set of statement constraints for the purpose of making that set of statement constraints referenceable as a Value Shape. Full stop.

kcoyle commented 2 years ago

@johnhuck suggests using the xml element "author" as a property. That seems to retain the XML structure. (It also seems obvious! doh!) Presumably cardinality of the shape would be given on the row with the propertyID -- at least, that's how we've done it so far. But is there a better way? Also, would there be other restrictions on the shape that would need to be expressed?

dcmi / dctap

Shapes and hierarchical metadata schemes #65