dcmi / dctap

DC Tabular Application Profile
https://dcmi.github.io/dctap/
34 stars 10 forks source link

What's a shape? #67

Closed kcoyle closed 2 years ago

kcoyle commented 2 years ago

ShEx:

"a shape describes the triples involving nodes in an RDF graph." "ShEx describes RDF graph [RDF11-CONCEPTS] structures as sets of potentially connected Shapes."

SHACL

Screen Shot 2022-04-01 at 8 58 16 AM

"A shapes graph is an RDF graph containing zero or more shapes that is passed into a SHACL validation process so that a data graph can be validated against the shapes."

kcoyle commented 2 years ago

Both ShEx and SHACL define the shape in terms of a view of the RDF graph. This jibes with @tombaker 's frequent statement that a shape is a "view over a graph" or over a pool of metadata. It reminds me of retrievals from a database, and in that sense RDF triples are a datastore from which particular views can be elicited. An example with a traditional database would be that you have a DB with 10 tables: one retrieval (view) could be of data from tables 1, 3 and 8; another could be of data from tables 1, 4, 5, and 9. (I can't offhand come up with an example for RDF - probably because I would need to draw it as pictures.)

Can we describe a profile as being a view that can be defined based on stored/available data? If so, the relationship to the metadata framework is not direct - the profile makes use of the stored data but may look different structurally. Therefore (if I can draw this conclusion) the shape in the profile is "invented" in the profile and does not necessarily exist as such in the metadata itself.

I also see this as: the metadata may have a node with arcs a, b, c, d, e. The profile may define a shape with a node that has arcs b, e. Those are not the same shape. Also, the profile may define a shape with a node that has arcs b, e, z. This also is not the same as the node in the metadata. However, the profile cannot violate relationships in the metadata. If the metadata says "a arc b arc c" and these have defined relationships, then the profile cannot declare "b arc c arc a". This gets all bundled up in things like sub/super relationships and transitivity, which we don't want to get near. I prefer to use the database analogy - it's what you can retrieve from the data store. In my view, both ShEx and SHACL work on shapes that are views in the sense I've said here.

philbarker commented 2 years ago

Both ShEx and SHACL define the shape in terms of a view of the RDF graph.

The difference is that in SHACL the shapes graph is (for want of a better word) the schema. It's the thing that you validate against. It's explicitly not the data graph(), which is the thing that you validate (except when you are using SHACL to validate SHACL). My TAP2SHACL program converts a TAP to a shapes graph. So in SHACL the shapes are in the application profile, not the metadata, which I think is a shame.

Talking about "the RDF graph" when schema definition, profile and data are all RDF graphs will only confuse us.

I would like to talk about shapes in the data in the terms that Tom does (the arcs and their values around a selected node in the data graph). The AP (or schema) defines shapes against which the shapes in the data are measured. Channeling John, it is similar to Euclid defining ideal shapes (triangles, squares, hexagons), so that we can measure the shapes that we find or construct in the real world and decide whether they are hexagons.

I think we need to distinguish between the definition of a shape (which SHACL calls a shape) and the shapes we have in data (which ShEx calls a shape).

kcoyle commented 2 years ago

The difference is that in SHACL the shapes graph is (for want of a better word) the schema. It's the thing that you validate against.

What I see in the SHACL spec is a difference between a shape and SHACL's shapes graph. Unfortunately SHACL has some pretty tortured definitions that are stated only in terms of SHACL elements, but shape itself is an RDF node that meets some SHACL-define shapes graph. For all that I do not like how SHACL defines things, I do think that a shapes graph imposes a template (!) over some data in a datastore (which may be just a file). That may fit with your Euclidean metaphor. You define a triangle then look to see if you find a triangle in the data.

That describes the "validation" role of a profile - validating data that exists. Then there is the description/definition/templating role, which presumably SHACL does not address. ShEx claims to be usable both for finding patterns in the data as well as for description of what we have called a metadata framework. (see this slide in Jose Labra's presentation.) I don't feel that has been clearly described, though, and Jose went through that quickly and without much explanation.

p.s. I am warming to using the term "template" for an entire profile or for individual shapes. I don't see it as a term for statement constraints. I think that's because to me a template is more about structure than about content. A triangle template would define the shape, but not, for example, the color. I'll try to get clear about this idea before taking it further.

philbarker commented 2 years ago

I've got quite familiar with SHACL over the last year, written a fair few shapes graphs and used them to validate instance data. The spec gets easier to understand with examples.

Shapes in SHACL are in the shapes gaphs. The spec says "Informally, a shape determines how to validate a focus node based on the values of properties and other characteristics of the focus node." where "an RDF term that is validated against a shape using the triples from a data graph is called a focus node."

One difference between shape in TAP and in SHACL is that in SHACL you can write shapes for predicates (all the instances of a property can be measured against a sh:PropertyShape). Thus a sh:PropertyShape is equivalent to what we have as a row in the TAP. SHACL also has NodeShapes, which (among other things) group together PropertyShapes, and so are close to our shape-as-a-collection-of-rows.

I also think I know what Jose Labra was getting at when he talked about finding patterns in data. I think it comes down to the way ShEx is inspired by regular expressions. Just as regex lets you write a "pattern" and then use something like grep to "globally search for a regular expression and print matching lines" so ShEx lets you write a "pattern" and search instance data to find matching "shapes" (i.e. nodes, arcs and values that match the Shape Expression).

Anyway, I think we are close if we can say that shape definitions are in application profiles and shape instances may be in the metadata.

tombaker commented 2 years ago

Just so we are aware...: In Linked Data Shapes, Forms and Footprints, TimBL discusses "shapes" in contrast to "forms" and "footprints":

tombaker commented 2 years ago

@philbarker

I think it comes down to the way ShEx is inspired by regular expressions. Just as regex lets you write a "pattern" and then use something like grep to "globally search for a regular expression and print matching lines" so ShEx lets you write a "pattern" and search instance data to find matching "shapes" (i.e. nodes, arcs and values that match the Shape Expression).

If shapes "explain to machines what data should look like", perhaps one could say that ShEx tells machines what to look for. A shape expression provides a view in the sense of "lens" or "filter". Camera filters can let through light of one color spectrum and block others. Mosquitoes can see infrared waves that humans perceive as heat. Who is seeing reality as it "really" is - the human? the mosquito? both? neither? (ultimately, an age-old question of philosophy...)

TimBL elaborates on how he sees ShEx as differing from SHACL (as a sort of recursive grep for graphs):

It does more than constrain the shape of the graph: it also defines a canonical ordered traversal of the graph. It is in a way a kind of query language. If applied simply, it (like SPARQL) returns an array of bindings. "Yes, it matches a contact shape, and here are all the names and phone numbers". If it is applied recursively, it returns a tree of bindings. "Yes, it matches the shape, and here are all the contact points; and for each contact point here are the address and phone number; and for each address the number and street"

I think of shapes as things in the data as seen or filtered through the lens of a shape expression. So shapes may be "in the data", but not independently of their perception through that lens.

However, we cannot adopt the distinction between "shape" and "shape expression" without implicitly siding with ShEx, so I can live with the notion of "shape" as a lens, filter, template, or view - as something "in the profile (or DCTAP instance)" that bears a deliberately underspecified relation to things "in the data". If we successfully keep it deliberately underspecified, the abstraction can fit ShEx, SHACL, and the various use cases for DC-style application profiles, which range from descriptive to prescriptive and from tolerant to strict.

tombaker commented 2 years ago

@kcoyle

Can we describe a profile as not being a view that can be defined based on stored/available data?

Did you really mean to say "not"? Because without the "not", I agree. If I'm understanding your point, I also agree that a "profile makes use of the stored data but may look different structurally" and that a "shape in the profile is 'invented' in the profile and does not necessarily exist as such in the metadata itself".

tombaker commented 2 years ago

I just now got around to reading blog post by Ruben Verborgh which, conveniently, fits what I was saying above:

Finally, this quote is fun:

Since I started thinking about shapes in RDF, I’m seeing many opportunities for them in both old and new problems I’ve come across. True to the statement When all you have is a hammer, every problem starts to look like a nail, they’ve become a new lens on reality. With shapes, the question becomes: can we reshape the problem into a nail? Or more accurately, can we transform it into a shape-shaped problem?

kcoyle commented 2 years ago

Thanks, @tombaker, for the link. I read through Ruben's post, and some of the documents he links to. I'd love to chat about it in more detail, but just a couple of observations:

I have lots of notes and question marks, but I think what we can agree on is that the TAP and the metadata framework may not be exact mirrors of each other, as the TAP can be a view that creates new shapes, as both Tom and I said above. My feeling now is that we should say that metadata frameworks can provide structure, either as graphs or as hierarchies. Then we can say that a profile can define shapes which may mirror the metadata framework structure or that may provide new views over the same metadata framework.

tombaker commented 2 years ago

@kcoyle

metadata frameworks

Or "metadata languages"? I'm never sure what a framework is, but RDF, OWL, XML Schema, RELAX-NG, ShEx, and SHACL are called languages, while JSON and JSON-LD are called formats. Then there's YAML ("data-serialization language") and and HTML and XML ("markup languages"), which I guess one could call "format languages" in contrast to "metadata languages".

kcoyle commented 2 years ago

That should have been "metadata models" - the term we have decided on in the framework.

tombaker commented 2 years ago

@kcoyle

"metadata models"

And nobody will ask, "You mean, like metadata in French or Japanese?"

tombaker commented 2 years ago

@philbarker @kcoyle @nishad @johnhuck I would submit that we have pretty much resolved this issue in the style guide, which says:

"""A set of statement constraints that applies to a single entity or concept is called a shape. A shape is set of statement constraints for a node in the metadata that meets some criterion or criteria, for example all belonging to a given class or being an object of a given property. Shapes in the profile may be the same as the structures defined in the metadata model, or they may be defined in the profile as a derived view over the metadata."""

kcoyle commented 2 years ago

Closed via style guide.