dcmi / dcap

DC Tabular Application Profile - supporting materials
28 stars 12 forks source link

Entity - For discussion #60

Closed kcoyle closed 3 years ago

kcoyle commented 4 years ago

The application profile is generally defined as metadata for a description of one or more things. The DCAM and DSP defined the structure of the profile as:

description with 1..n
    statement with 1..1
       value

"Entity" is what we are currently calling the "description" level in our simple template. The entity is represented in a column called "entity_name" which is an identifier for the entity; it could be a simple literal name, or it could be an IRI. There is an "entity_label" column for human display.

kcoyle commented 4 years ago

Some terms being used in the application profile area for this same (or analogous) concept:

DSP: description Resource BIBFRAME: Resource template Sinopia: resource YAMA: descriptions IMS global: element ODRL: class DCAT: class Wikidata: item

(edit and add others that you know)

nishad commented 4 years ago

Wikidata concepts are items, statements, and properties.
Items are 'things', and statements belong to items. Statements are consisting of property-value pairs.

Also for #59.

tombaker commented 4 years ago

DSP: description

DCAM and DSP move on different levels.

The distinction in both DCAM and DSP between "literal" and "non-literal" served as a hook for triggering the use of one or the other of the two template structures. This distinction is irrelevant for our purposes here.

DSP - a language of "templates" and "constraints"

DSP model diagram

DCAM - abstract model for the above

DCAM model diagram

tombaker commented 4 years ago

As I currently see it (see gist), we need to make a distinction similar to that between DCAM and DSP:

philbarker commented 4 years ago

My key ask is that the definition makes clear the 'meta'-ness of what is entered under this heading:

Ideally the name would also capture this, however I realise that is hard to do in one or two words.

It also looks like we need clarity on whether we are talking about a section of the AP (a template) or the naming of the kind of thing that part of the AP relates to (Entity/Resource). They're closely related becase both are defined in terms of common properties or characteristics. e.g. in DSP :

Description templates , which contain the statement templates that apply to a single kind of description as well as constraints on the described resource

So these templates are like the section of the csv for describing some kind of entity (like the Book section in the bookshelf example). Presumably somewhere in the template you may say that the template is suitable for desribing things that are a schema:Audiobook or a frbr:Example or whatever.

philbarker commented 4 years ago

Question: what does "entity" relate to in the csv?

Is it something like a DSP description template ("which contains statement templates"), in which case it is a "block" of the csv eg:

Entity_name   |   Property     |      Value
 book         |   dct:creator  | entity:person
              |   dct:title    | 
              |   dct::date    |  xsd:2017

Or is it a type of thing being described & constrained by a statement in the application profile (as opposed to a property). Nishad's suggestion looks a bit more like that:

Type   |  ID    |  Property  |  Label    |  Mandatory    |  Repeatable    |  ValueType    |  Value    |  Annotation
entity |  book  |  sco:Book  |  Book

That row from Nishad's suggestion looks like the "constraints on the described resource" from the DSP Description template.

tombaker commented 4 years ago

@philbarker You made the point awhile ago that the name should reflect the "meta-ness" - I think you suggested "Archetype" and reminded us that naming is hard - and I agree. My current favorite for that term is "Entity shape", for several reasons: 1) "entity" implies "instance", but this is more like a "class or grouping of instances", only 2) this is not fundamentally a "semantic" construct (about "reality", like an OWL ontology) but more a "syntactic" construct (about "things found in data", like a ShEx schema), and 3) it is the equivalent of what ShEx, Wikidata, and SHACL call a "shape", so the term is already somewhat known in this space (unlike "archetype"), but 4) "entity" is more widely understood, in an intuitive sense, than the newer term "shape". Hence: "entity shape".

As I see it, "entity shape" is the same as DSP's "description template", if we wanted to adopt that terminology. (Alternatively, we could just settle on our column headings for this model, then provide "translations" of that model into DSP, library terminology, etc.)

tombaker commented 4 years ago

@philbarker To me, there is no doubt that we are indeed talking about a "block" of the csv (i.e., a set of statements, regardless of whether "statement" is defined in terms of an attribute/value pair or as a "triple" in the RDF style).

I find the suggestion cited above, from @nishad, confusing because sco:Book looks like the name of a class but is found in a column labeled "Property". In my understanding, Nishad is proposing to interpret the meaning of things in the "property" column based on the type of the row in which the property cell is found - ie, BASE, prefix, entity, or statement as per the gist. In general, I think we should avoid making the interpretation of cells in one column dependent on values found on the same row in another column (eg, type as in "row type"). The harder is is to explain, the more brittle it will be.

philbarker commented 4 years ago

Thanks @tombaker that's exactly what had not been defined before.

OK, I think "Entity shape" is a good name for a block (or set of statements relating to the description of some kind of thing)

We do need a way of saying that a block relates to the kind of thing that is called a Book in schema.org, though whichever column that information is in should not be headed 'Property'.

To me there is no difference in meaning in saying that:

kcoyle commented 4 years ago

My problem with "entity shape" is that the column itself is for just the entity, and the "shape" is expressed only by the entire row. So there is a difference in my mind between what the row represents (for which "shape" makes sense) and what one needs to express in the individual columns. Users of the template will be completing the profile by putting values in the cells, and the sum total of a row of those values (which represent some thing and the property/value pair that is used to describe it) is a shape.

kcoyle commented 4 years ago

@philbarker 's comment above reminds me that some examples (possibly Nishad's) have placed the "entity" on a row by itself, followed by rows that carry the property/value pairs. We seemed to have ruled out that model, but need to make that decision overt.

tombaker commented 4 years ago

To me, entity shape is the abstract notion, and the spreadsheet does not actually have a column for entity shape. Rather, the spreadsheet has columns for entity shape id and entity shape name.

The name of the entity, as I see it, bears merely a coincidental relationship to the name of the class of things described the entity shape.

The data can say that the described resource is an instance (rdf:type) of sdo:Book, but the entity shape id need not be book; it might just as well be "asdfgh".

On Tue, May 5, 2020, 16:41 Karen Coyle notifications@github.com wrote:

My problem with "entity shape" is that the column itself is for just the entity, and the "shape" is expressed only by the entire row. So there is a difference in my mind between what the row represents (for which "shape" makes sense) and what one needs to express in the individual columns. Users of the template will be completing the profile by putting values in the cells, and the sum total of a row of those values (which represent some thing and the property/value pair that is used to describe it) is a shape.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dcmi/dcap/issues/60#issuecomment-624096194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIOBJUYEVTNOZQFQZAGQOTRQAQQNANCNFSM4MPKU5HQ .

tombaker commented 4 years ago

Whether or not something is declared to be an instance of a class is IMO irrelevant to the entity shape.

To me, the challenge is to convey the idea that the entity shape is not a class but merely the label for a syntactic construct.

If there is no requirement to declare explicitly, in the data, that the described resource is an instance of suchandsuch a class, it's fine to simply omit declaring an rdf: type ( or dc:type or whatever).

An application profile needs a handle for describing a particular chunk of the data, the entity shape, but that shape does not necessarily need to have an " rdf:type" statement.

On Tue, May 5, 2020, 18:23 tombaker notifications@github.com wrote:

To me, entity shape is the abstract notion, and the spreadsheet does not actually have a column for entity shape. Rather, the spreadsheet has columns for entity shape id and entity shape name.

The name of the entity, as I see it, bears merely a coincidental relationship to the name of the class of things described the entity shape.

The data can say that the described resource is an instance (rdf:type) of sdo:Book, but the entity shape id need not be book; it might just as well be "asdfgh".

On Tue, May 5, 2020, 16:41 Karen Coyle notifications@github.com wrote:

My problem with "entity shape" is that the column itself is for just the entity, and the "shape" is expressed only by the entire row. So there is a difference in my mind between what the row represents (for which "shape" makes sense) and what one needs to express in the individual columns. Users of the template will be completing the profile by putting values in the cells, and the sum total of a row of those values (which represent some thing and the property/value pair that is used to describe it) is a shape.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dcmi/dcap/issues/60#issuecomment-624096194, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAIOBJUYEVTNOZQFQZAGQOTRQAQQNANCNFSM4MPKU5HQ

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dcmi/dcap/issues/60#issuecomment-624156245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIOBJQSX3Y2VRC3D7RRECLRQA4RNANCNFSM4MPKU5HQ .

philbarker commented 4 years ago

@tombaker yes, I'm getting there. I assume (or whatever) for simple non-RDF XML vocabularies. I'd need to see an example of how you would work with something like LOM in XML with all the nested values.

Agree that if what we are talking about is like a shape or description template then it's not a column. Also giving it a name/id like foaf:Person makes no sense.

I'm not entirely sure how it relates to anything called an entity in other specs.

Is there a reason why we're not just following DSP terminology?

tombaker commented 4 years ago

@philbarker On today's call, some objected to the DSP use of "template" because it implies something that gets filled in with information (in the manner of a Web form, as I picture it) - the cookie-cutter. I'm not sure I entirely understand the objection - will need to ponder...

tombaker commented 4 years ago

Since it came up on today's call, I thought it would be useful to look at the DCAT model.

DCAT model

tombaker commented 4 years ago

In my reading of the model, the classes really are RDFS classes (dcat:Dataset, foaf:Agent...) and are defined as such in the RDF schema for DCAT. In the diagram, those classes are associated with a set of properties, in the manner of a shape (in the ShEx sense) or of Schema.org: compare sdo:Dataset, which is shown along with properties typically use to describe a dataset. (I suppose these could be described, in a hand-wavy sort of way, as O-O classes, and I have no doubt that alot of DCAT users think in terms of O-O, but I do not find in the specification any explicit reference to O-O as a programming paradigm.)

The example data snippets they provide, however, such as:

<ds913>
  a dcat:Dataset ;
  dct:accrualPeriodicity <http://purl.org/cld/freq/daily> ;
  dcat:temporalResolution "PT15M"^^xsd:duration ; .

confirm that the description of a dataset, in DCAT, is intended to explicitly include an rdf:type dcat:Dataset statement.

In my reading, therefore, DCAT really is using "class" in the RDFS sense, only it is associating those classes with sets of possible or expected statements in the style of Schema.org. In terms of our simple CSV model, as I picture it, the profile for the description of a dataset, using DCAT, might have an "entity shape" (or "description template") labeled "Dataset". However, the identifier for the shape or template would not serve the same purpose as the rdf:type dcat:Dataset statement. The former is not something found in DCAT data, but the latter is.

Bottom line: I do not think that "class" is being used in DCAT as the equivalent of what we have been calling "entities" (or "entity shapes" or whatever), as suggested above.

tombaker commented 4 years ago

It looks to me like ODRL has the same sort of model: OWL classes, declared as OWL classes and invoked in instance data in JSON-LD using @type e.g., "@type", "Offer", where "Offer" resolves to "http://www.w3.org/ns/odrl/2/Offer" via the JSON-LD context, "@context": "http://www.w3.org/ns/odrl.jsonld".

For my taste, the language of the ODRL spec is a bit casual in the way it implies that OWL classes somehow have RDF properties (e.g., "An AssetCollection class has the following properties..."). However, there is nothing in the ontology that somehow extends OWL to associate properties with classes in a formal sense. The @type attributes translate straightforwardly into rdf:type statements -- e.g., <http://example.com/policy:3333> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/odrl/2/Offer>, one of triples that result from pasting example 4 into an RDF translator.

In other words, the language of the ODRL spec may not sound like orthodox OWL, but under the hood it actually is OWL. AFAICT its language is merely following the Schema.org and DCAT style of associating a class with a set of properties in a sort of conceptual shorthand.

One would, at any rate, never use an ODRL class URI as the "entity ID" (or "entity shape ID" or "archtype ID" or whatever) in our CSV model because odrl:Offer would be meaningless as a value of type "entity" in the model we have been discussing. odrl:Offer identifies an OWL class, not a set of statements, and nobody has actually extended OWL in a way that would allow it to identify both.

Rather than tying ourselves into knots devising a CSV model that somehow tries to accommodate this conceptual shorthand, or presentation style (as I think of it), it would IMO be easier to recognize that it already fits our CSV model and add some explanatory text to the CSV model specification to the effect that "entity shape ID" (or whatever we call it) has a different function from a type statement (eg, rdf:type odrl:Offer) but that, for the sake of readability, it may often make sense to label the entity shape with the same word used to label the OWL class -- i.e., entity shape ID offer, which is the same word string as the label of the class (odrl:offer rdfs:label "Offer"@en) as declared in the ODRL schema.

kcoyle commented 4 years ago

@tombaker I don't know why it matters to our design what people use as their entity as long as it works for them. If the DCAT-AP documentation, which looks like this:

Screenshot 2020-05-07 06 32 54

is coded in our template as:

Entity_id Entity_label Property Property_Label Mandatory Repeatable Value_type Value Annotation
dcat:Catalogue Catalogue dcat:dataset dataset y y rdfs:Class dcat:Dataset This property links the Catalogue...
dcat:Catalogue Catalogue dct:description description y y rdfs:Literal This property contains a free text account...

that seems fine to me, again assuming that this meets their need. In any case, we can't prevent people from using our template in this (or any other) way. Others may make a different decision. In fact, I'm trying to imagine how this would be coded without using dcat:Catalogue as the entity but I haven't wrapped my head around that. If you have time I would like to see how you would do that. This solution seems pretty straight-forward.

(Note: I wasn't sure what to do about range, but this just points out to me that we should try coding up as many APs as we can find as a way to test out our model.)

philbarker commented 4 years ago

It strikes me that DCAT have created some metadata terms as well as profiled some others; the rdfs class dcat:Catalogue being one of the terms they created. For an application profile to be purely about mixing and matching from existing namespaces(/vocabularies/ontologies), I guess you would say they first create their terms in their namespace and then mix those terms in with others in an application profile. In that view dcat:Catalogue is not defined by the application profile, and so consequently the definition of dcat:Catalogue is not a concern of the application profile.

So let me amend Tom's bottom line: I do not think that "class" is being used in the Application Profile part of DCAT as the equivalent of what we have been calling "entities" (or "entity shapes" or whatever) [assuming we all agree that we are talking about sets of rows in the csv]

philbarker commented 4 years ago

IMO: the word entity for a part of the application profile causes more misunderstanding than clarity. Everyone comes thinking they know what it means, but they don't. [edit: I mean don't all agree on what it means in this context!]

An alternative

Description Shape: defines the permissable terms drawn from existing vocabularies that can be used to describe a specific kind of resource in metadata.

Notes:

[* I would still like to check how this works with a complex XML example, e.g. LOM, SCORM]

tombaker commented 4 years ago

@kcoyle The snippet above (mandatory properties "for Catalogue") looks to me like a vocabulary definition for defining properties with their ranges, with the addition of cardinalities for when they are used to describe instances of dcat:Catalogue. So (omitting Mandatory, Repeatable, and Annotation to make it fit on one line):

Entity_shape_id Entity_shape_label Prop_ID P_Label Value_type Value
catalogue Catalogue rdf:type Type URI dcat:Catalogue
dcat:dataset Dataset entity_shape @dataset
dataset Dataset rdf:type Type URI dcat:Dataset

In English: An instance of the class Catalog has at least one thing that is an instance of the class Dataset.

Both the Catalogue and the Dataset are described in their own descriptions, with different sets of properties. The string catalogue (which is labeled with the display string "Catalogue") is performing the function of connecting the object of the dcat:dataset statement with the subject of the statement that says rdf:type dcat:Dataset.

To say that the class dcat:Catalogue has a property dcat:dataset makes no sense if the intended meaning is that an instance of the class dcat:Catalogue has a dcat:dataset relationship to an instance of the class dcat:Dataset. This is the intended meaning regardless of what the conceptual shorthand seems to say. Note that the OWL ontology where dcat:Catalogue is defined actually says (shortened):

dcat:Dataset
  a rdfs:Class ;
  a owl:Class ;
  rdfs:comment "A collection of data, published or curated by a single source, and available for access or download in one or more represenations."@en ;
  rdfs:isDefinedBy <http://www.w3.org/TR/vocab-dcat/> ;
  rdfs:subClassOf dcat:Resource ;

It would make no sense to say, in the definition of dcat:Catalogue that in addition to rdfs:comment and rdfs:subClassOf, the class also has the property dcat:dataset.

It is IMO unhelpful when specifications slip into saying that "Class X has properties P, Q, and R" as a sort of conceptual shorthand for "Instances of class X are described with properties P, Q, and R" - where P, Q, and R are properties of the instance (some dataset or catalogue) rather than the class Dataset or Catalogue.

Bottom line: We need to look at what the instance data, on the one hand, and the vocabulary declarations, on the other, are actually saying. And neither one is saying that the class dcat:Catalogue has a property dcat:dataset, because that would make no sense and is not really what they are trying to say. The CSV snippet above is a more accurate translation of what they are actually saying.

tombaker commented 4 years ago

@philbarker

IMO: the word entity for a part of the application profile causes more misunderstanding than clarity. Everyone comes thinking they know what it means, but they don't.

I completely agree.

An alternative

Description Shape: defines the permissable terms drawn from existing vocabularies that can be used to describe a specific kind of resource in metadata.

I like "Description Shape"! In the definition, I might want to say "defines the set of statements" (or words to that effect), because I think the definition should convey the idea that the shape encompasses not just properties, but properties and (at least implicitly) values. But I think we are on the same page :-)

kcoyle commented 4 years ago

@philbarker re: your idea. Can you create a small example? And note that at the meeting this week there was some pushback on the use of "shape" which seems not to have entered into general use (and may not make sense to non-RDF folks). Can it not be just "description"? Or "description definition"? Something that is more immediate to folks who don't see their data as shapes.

@tombaker Although you say that it would make no sense to create such a profile table, I am more interested in how you think we can explain and enforce your choice. That is what I'm concerned about here, that our tabular format is intended to be quite simple and that people can really do with it what they want, and I'm pretty sure that some folks will do as I showed because it fits with their thinking (and they might be wrong, but that's not our business). So we need to look at what makes sense to potential users, not just to this small group. For that, I'm trying to find APs that exist today to see how people are using them. We won't be able to enforce any particular formalisms, IMO, so we need to try to be as intuitive as possible. What that "intuition" turns out to be is not something I have a stake in -- I just want it to make sense to the largest number of people so that they can use it.

I may ask Makx or some other DCAT users to see how they think they might make use of the template. I'm also contacting Bibframe users. We have the Wikidata example. Let's gather what we can, and also talk to people. I think that could help.

tombaker commented 4 years ago
Entity_shape_id Entity_shape_label Prop_ID P_Label Value_type Value
catalogue Catalogue rdf:type Type URI dcat:Catalogue
dcat:dataset Dataset entity_shape @dataset
dataset Dataset rdf:type Type URI dcat:Dataset

Note that this CSV repeats the string "dataset" six times:

tombaker commented 4 years ago

@kcoyle How would you translate your CSV example above into English? Perhaps I'm missing your intention.

Description (unqualified with something like shape) is problematic because we're defining the description of a description of a dataset not the description of an actual dataset. We need to convey what Phil called, I think, the meta-ness.

On Thu, May 7, 2020, 19:08 tombaker notifications@github.com wrote:

Entity_shape_id Entity_shape_label Prop_ID P_Label Value_type Value catalogue Catalogue rdf:type Type URI dcat:Catalogue dcat:dataset Dataset entity_shape @dataset https://github.com/dataset dataset Dataset rdf:type Type URI dcat:Dataset

Note that this CSV repeats the string "dataset" six times:

  • As a (prefixed) URI for a class: dcat:Dataset (unavoidable, because the DCAT model requires the class of an item to be stated explicitly in the data)
  • As a (prefixed) URI for a property: dcat:dataset (unavoidable, because the DCAT model requires an instance of Catalogue to point to an instance of Dataset by using this property)
  • As the label for that property: Dataset (there is a strong convention to display the property labeled as it is in the RDF schema https://www.w3.org/ns/dcat2.ttl where it is defined)
  • As an entity shape ID: dataset (by choice, because the entity shape could be called xyz or satellite_dataset)
  • As a reference to the entity shape ID: @dataset (as a value) (by choice, because it could be @xyz or @satellite_dataset)
  • As the label for the entity shape ID: Dataset (by choice, because it could be labeled Satellite Dataset; it could also be labeled Xyz but that would not be very user-friendly...)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dcmi/dcap/issues/60#issuecomment-625382168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIOBJV7MSWMQXFIIOZAYM3RQLTJNANCNFSM4MPKU5HQ .

kcoyle commented 4 years ago

@philbarker

In that view dcat:Catalogue is not defined by the application profile, and so consequently the definition of dcat:Catalogue is not a concern of the application profile.

It is my error that I spoke of DCAT when I should have called it DCAT-AP. DCAT is the main model and vocabulary, but DCAT-AP is the implementation of the vocabulary. The DCAT model, which Tom referenced, does define some DCAT-specific terms; the -AP (and there are a number of AP's, listed here) reuses the vocabulary, and therefore it fits (I think) our definition of AP.

Also note that most of the APs there are APs of the DCAT-AP. That is how that community decided to do it. So there are complications regarding what it means to be an AP of an AP, which I hope we don't have to get into.

kcoyle commented 4 years ago

@tombaker

How would you translate your CSV example above into English? Perhaps I'm missing your intention.

(Note: I added a new row with a literal value, so with a clearer value type.) I would say that I have a thing that is a dcat:Catalogue, and that thing has certain descriptive properties that are required by my application profile to describe it. One of these properties is dcat:dataset (*) and it has a value type of dcat:Dataset (note questions about how to express ranges). That property also has an instruction (which I partially pasted). Another of those values is dct:description, which takes an unspecified literal value, and has some instructions associated with it.

I think the next step is to create some instance data and see what does and doesn't fit. I am not ignoring the objections to using an RDF class as an entity "anchor" but would like to see how it plays out in practice. What would the output of this look like when it is transformed from its tabular format? What is the intended instance data supposed to look like? I'll try to get some data examples.

Another interesting step would be to see a ShEx or SHACL implementation to see how they validate their data and how that might affect its implementation. I don't think that I have encountered code for the AP, unless they assume closed-world OWL for constraints. I'll poke around to try to answer that. I think this is an good way to test out our thinking against real cases, and I would like to find more!

As I said, I wasn't sure what to do with the range of dcat:dataset - whether that is a value type or a value. This is similar to the question of what to do with sx:IRIstem - is it the value type IRIstem? or is the value type IRI and the value is IRIstem + the related stems. I guess we'll hash that out when we look at value types.

kcoyle commented 4 years ago

Here's the DCAT-AP SHACL, reduced to the two properties I highlighted above:

dcat:Catalog
  rdf:type sh:NodeShape ;
  sh:name "Catalog"@en ;
  sh:property [
      sh:path dct:description ;
      sh:minCount 1 ;
            sh:nodeKind sh:Literal ;
      sh:severity sh:Violation ;
    ] ;
  sh:property [
      sh:path dcat:dataset ;
      sh:class dcat:Dataset ;
      sh:minCount 1 ;
            sh:severity sh:Violation ;
    ] ;
.

If I'm reading this right, the class "dcat:Catalog" defines the shape, and each of the properties follows. That looks quite a bit like the table at comment, although presumably it could also be developed from the comments here.

tombaker commented 4 years ago

@kcoyle The SHACL example above uses the URI of a class, dcat:Catalog, which is declared in an OWL ontology to be instance of an OWL class, as the URI of a shape.

However, I find no examples in the SHACL spec of any OWL class being used (and declared) to be declared also as an instance a SHACL shape (as in dcat:Catalog rdf:type sh:NodeShape above), and I do not understand, in modeling terms, what that would mean.

I do see examples in the SHACL spec of shapes with (prefixed) URIs, but these shapes are very clearly distinct from the class used to described instances in the context of the shape. From the spec:

ex:PersonShape
    a sh:NodeShape ;
    sh:targetClass ex:Person .

which says: ex:PersonShape rdf:type sh:NodeShape and ex:PersonShape sh:targetClass ex:Person.

This shape (ex:PersonShape) matches data that says:

ex:Alice a ex:Person .
ex:Bob a ex:Person .
ex:NewYork a ex:Place .

The purpose of the property sh:targetClass is to relate instances of a shape to a class in the following way: "Target class declarations specify that all instances of some class must be validated with some shape" (from Validating RDF Data).

Bottom line: ex:PersonShape is not the same as ex:Person. Rather, the shape has a specific relationship to the class: ex:PersonShape sh:targetClass ex:Person.

tombaker commented 4 years ago

@philbarker

It strikes me that DCAT have created some metadata terms as well as profiled some others; the rdfs class dcat:Catalogue being one of the terms they created. For an application profile to be purely about mixing and matching from existing namespaces(/vocabularies/ontologies), I guess you would say they first create their terms in their namespace and then mix those terms in with others in an application profile. In that view dcat:Catalogue is not defined by the application profile, and so consequently the definition of dcat:Catalogue is not a concern of the application profile.

Yes - well put!

A CSV template for declaring new properties and classes would look alot different from the CSV template we are discussing for application profiles. Trying to define both namespaces and profiles in a single template would be quite confusing! The principle that "namespaces declare" and "profiles reuse" creates a helpful separation of concerns - even if some specifications like DCAT blur the line for the sake of having one-stop specification (though under the hood, DCAT does in fact separate out the definition new DCAT terms into an RDF schema.)

kcoyle commented 4 years ago

@tombaker Are you saying that the DCAT-AP SHACL is wrong in some way? I can ask Makx if it's being used and is working.

Here's another question: Does the identifier for the "entity" that is in the profile have to also be in the instance data? In other words, what connects the profile's "entity" to the instance data?

And another (sorry! woke up thinking about this): Does this example:

catalogue Catalogue rdf:type Type URI dcat:Catalogue

mean that the instance data must have ''X rdf:type dcat:Catalogue''? In other words, is rdf:type just another property that the profile is defining?

tombaker commented 4 years ago

@kcoyle

mean that the instance data must have ''X rdf:type dcat:Catalogue''? In other words, is rdf:type just another property that the profile is defining?

yes, exactly.

Here's another question: Does the identifier for the "entity" that is in the profile have to also be in the instance data? In other words, what connects the profile's "entity" to the instance data?

ShEx would use a "shape map" to associate RDF nodes with ShEx shapes. There are many ways to make such associations automatically (e.g., by query), though it can also be done by enumeration. One popular criterion is to say that a given shape matches things that are instances of a given class - but it is just one possible criterion.

I do not think it would make sense to use the URI of a class as the URI of a shape, unless perhaps in the context of some closed system ("in the privacy of one's own database"). I do not know if classes and node shapes are disjoint in a formal sense, but saying both dcat:Catalogue rdf:type sh:nodeShape and dcat:Catalogue rdf:type rdfs:Class would not seem like a helpful thing to do in the context of open data.

philbarker commented 4 years ago

Trying to catch up with the conversation, apologies if I'm talking at cross purposes to where the conversation has gone.

@kcoyle said

@philbarker re: your idea. Can you create a small example? And note that at the meeting this week there was some pushback on the use of "shape" which seems not to have entered into general use (and may not make sense to non-RDF folks). Can it not be just "description"? Or "description definition"? Something that is more immediate to folks who don't see their data as shapes.

re: example, I think I only defined things that are in existing examples, but I will work on linking those definitions to an example.

I agree with @tombaker that 'Description' is problematic because a description is the end product. I like "shape" because it has not entered in to general use. Any word in general use will come with baggage from that use that leads to people thinking they understand it from some other context. Any term we use has to be defined in such a way that it makes sense in the context of what we're doing.

@tombaker said of "Description Shape" that "

In the definition, I might want to say "defines the set of statements" because I think the definition should convey the idea that the shape encompasses not just properties, but properties and (at least implicitly) values

-- Agreed.

And I think we're all on the same page about DCAT AP and DCAT defined terms.

@kcoyle replied to @tombaker :

How would you translate your CSV example above into English? Perhaps I'm missing your intention.

I would say that I have a thing that is a dcat:Catalogue, and that thing has certain descriptive properties that are required by my application profile to describe it.

I don't think that is clear. If you have a thing that is a dcat:Catalogue you have a catalogue of some type. That's not something that is in the Application Profile, that is something that is described by metadata than conforms to the AP. It is not something with descriptive properties, it is something described by statements that use those properties. That's why I think Entity is the wrong term for what we have in the AP.

Wouldn't it be better to say something like: "When describing a Catalogue, assert that it matches the definition of being a dcat:Catalogue, and use the following properties and value spaces..."

philbarker commented 4 years ago

Here's the example. I may have overthought it.

It seemed that IDs for statements about properties weren't used anywhere. Ignore that the namespaces use the same columns as the other rows if that bothers you--I think this is still up for discussion but for now it's just more readable that way. Looking at it now, I think maybe it would be better to have separate columns for "shape ID" and "namespace prefix" and keep the "AP ID" as a identifier for the AP Statement (if we need one)

An Application Profile specifies which terms from existing vocabularies can be used in a metadata description, and may specify constraints on how those terms can be used.

An Application Profile csv is a csv file that defines an application profile.

Example Application Profile csv, for a book lending social network, to specify metadata showing which books are owned by whom, and who is known to the owners of those books.

AP ID External ID AP Label value type value space
sdo: http://schema.org/ schema.org
foaf: http://xmlns.com/foaf/0.1/ FOAF
dct: http://purl.org/dc/terms/ DC Terms
Book sdo:Book Book
sdo:name title Literal
sdo:creator written by URI Author
dct:rightsHolder owner URI Owner
Author sdo:Person Author
sdo:givenName given name Literal
sdo:familyName family name Literal
Owner foaf:Person, sdo:Person Owner
sdo:givenName given name Literal
sdo:familyName family name Literal
foaf:knows friend of URI Owner

Note: blank rows are for readability only.

An Application Profile statement defines how a term from an existing vocabulary may be used and specifies any constraints on the use of that term.

An Application Profile statement refers to an existing vocabulary, an entity type / class, or a property.

Application Profile statements are manifest as rows in the Application Profile csv

An Application Profile statement has cells for the following information (i.e. these are the headers for the columns in the Application Profile csv).

An Application Profile Description Shape defines the permissable terms drawn from existing vocabularies that can be used to describe a specific kind of resource in metadata. An Application Profile Description Shape comprises a set of Application Profile statements. An Application profile shape may optionally begin with an Application Profile statement that refers to an entity type or class, in which case the shape applies to the description of resources of that kind. An Application profile shape ends with the statement before a statement that refers to an entity type or class, as this begins the next shape. (In the csv shapes that don't begin with a statement that refers to an entity type or class must come before any that do.)

Example Application Profile statement about a namespace AP ID External ID AP Label value type value space
sdo: http://schema.org/ schema.org

Translation: In this Application Profile the vocabulary with the external identifier http://schema.org/ is identified assdo: and references to it should be labeled as schema.org Note: where appropriate this is equivalent to specifying that the prefix sdo: refers to the URI http://schema.org/

Example Application Profile Description Shape AP ID External ID AP Label value type value space
Book sdo:Book Book
sdo:name title Literal
sdo:creator written by URI Author
dct:rightsHolder owner URI Owner
We know that a Description Shape starts with a Application Profile statement refering to an entity type or class in this case: AP ID AP Label External ID value type value space
Book Book sdo:Book

[issue: how do we know this refers to an entity type or class & not a namespace?] Translation when describing things that we identify as Books, and label as Books, we assert that these things meet the definition of a sdo:Book (in RDF this is equivalent to saying that the description must include the statement X rdf:type http://schema.org/Book )

The next Application Profile statement refers to a property: AP ID External ID AP Label value type value space
title sdo:name title Literal

A description of a Book may have a statement about the title of the book using the sdo:name property, which must have a Literal as a value. When used for books this property is labeled "title".

The next Application Profile Statement refers to a property and specifies a constraint on the values that may be provided in description statements using that property when describing Books AP ID AP Label External ID value type value space
author written by sdo:creator URI Author

A description of a Book may have a statement using the sdo:creator property, the value of which must conform to the Description Shape identified in this application profile as Author.

philbarker commented 4 years ago

(note, I fixed some mis-placed |s in the table formatting for the examples in the github comment, but that won't fix the emails if you're following this issue by email. So beware the emails have some things a column out)

tombaker commented 4 years ago

@philbarker I see that @johnsamuelwrites also puts prefix definitions in the same columns as the AP statements in his ShExStatements project - see hospital.csv.

The ShExStatements command-line tool already supports our CSV model; it should be easy to tweak it once we agree on our own terminology.

Note that @johnsamuelwrites uses the term Node Name for what we are calling Entity, Entity Shape, Description Template, etc.

tombaker commented 4 years ago

@philbarker I agree with alot of your model above but really dislike the idea of using sdo:Book as an "external ID" for the book shape.

As I write above, I do not think it is good modeling practice, in an open data environment, to identify a shape using a class URI - a modeling style that your example seems to actively encourage.

Even in this "shorthand' style of modeling, it would be mandatory for the data to say rdf:type sdo:Book, but those statements are missing from your example.

More fundamentally, I do not like the idea of a column defined by its use of "identifiers". Base namespace URIs, class URIs, and property URIs have different structural positions in the model. Putting too many types of things into a single column for the sake of compactness makes it difficult to grasp what the column actually means. As a column heading, "External ID" is unhelpfully abstract and generic.

I still think that for the sake of clarity, namespace prefixes should have their own columns, and maybe even their own row, but that is a different discussion...

tombaker commented 4 years ago
@philbarker your example, as I see it: Example Application Profile Description Shape Entity Shape ID Property ID Property Label Value type Value space
@book rdf:type instance of URI sdo:Book
sdo:name title Literal
sdo:creator written by Entity Shape Ref @author
dct:rightsHolder owner Entity Shape Ref @owner
philbarker commented 4 years ago

@tombaker :

@philbarker I agree with alot of your model above but really dislike the idea of using sdo:Book as an "external ID" for the book shape.

Ah, that's not what I was meaning. None of the values in that column were meant identifiers for structures in the AP, they identify the external things that the row/statement relates to.

perhaps

External ID identifies the term(s) from existing vocabulary(ies) to which this statement relates

would be clear as

External ID identifies the term(s) from existing vocabulary(ies) that this statement includes in the AP and specifies constraints for.

The shape with the internal identifier Book constrains the use of things that are externally identified as schema.org/Book In RDF this is equivalent to saying that the description must include the statement X rdf:type http://schema.org/Book (so, yes, that's the statement you would get in the RDF instance data

I don't mind you're approach, but how does it work for things that are not in RDF?

tombaker commented 4 years ago

@philbarker

The shape with the internal identifier Book constrains the use of things that are externally identified as schema.org/Book In RDF this is equivalent to saying that the description must include the statement X rdf:type http://schema.org/Book (so, yes, that's the statement you would get in the RDF instance data

I'm not following you here... Are you saying that instead of treating rdf:type sdo:Book as a statement, like any other statement, it would be the task of the conversion script to translate the sdo:Book value in the External ID column into an rdf:type sdo:Book statement?

I don't mind your approach, but how does it work for things that are not in RDF?

If the CSV is not intended to be converted into RDF, but some other representation (such as XML or JSON), what requirement might there be to identify that thing with a URI? Would the conversion script need to translate the sdo:Book value in the External ID column into something else?

kcoyle commented 4 years ago

My question is what in the instance data connects to the profile? Given Tom's profile above:

Entity Shape ID Property ID Property Label Value type Value space
@book rdf:type instance of URI sdo:Book
sdo:name title Literal
sdo:creator written by Entity Shape Ref @author
dct:rightsHolder owner Entity Shape Ref @owner

What would my instance data look like, and what aspects of the instance data would be identifiable as being expressed in the profile, and thus can be validated by the profile?

I think this is a separate question, so I will open a new issue for this, but it definitely relates to the question of what defines an entity.

tombaker commented 4 years ago

@kcoyle As I see it...

Those are the parts that can be validated. As for the rest:

As I have suggested above in my comments about DCAT-AP, putting Entity Shape ID into the data in the form of a class URI strikes me as a hack of convenience for data that is designed or intended to be interpreted in a closed or enterprise environment. This is, however, antithetical to the idea of Linked Open Data inasmuch it aggressively asserts a strict Class-to-Shape correspondence. If instances of dcat:Catalogue were never allowed to match any shape other than the shape asserted, this would define any creative or simply different uses of the class, in an open data environment, as "simply wrong". I am surprised, because none of the examples in the W3C Recommendation for SHACL use this (anti-)pattern.

philbarker commented 4 years ago

@tombaker

I'm not following you here... Are you saying that instead of treating rdf:type sdo:Book as a statement, like any other statement, it would be the task of the conversion script to translate the sdo:Book value in the External ID column into an rdf:type sdo:Book statement?

Yes. Or whatever the statement would be for the framework & syntax being used.

If the CSV is not intended to be converted into RDF, but some other representation (such as XML or JSON), what requirement might there be to identify that thing with a URI? Would the conversion script need to translate the sdo:Book value in the External ID column into something else?

Not necessarily identify with a URI, but it will need to know what's being profiled. It wouldn't be a case of translating sdo:Book into XML because schema.org isn't an XML based spec. But if the bookshelf application profile was using CrossRef DOI metadata there would need to be a way of saying in the application profile that the description shape being called @Book in the AP was a profile of the book element in the xml namespace http://www.crossref.org/schema/4.3.7 Whatever was using the AP would have to translate that into whatever template or validation structure matches to

<doi_batch xmlns="http://www.crossref.org/schema/4.3.7" ...>
  <head>...</head>
  <body>
      <book>
      ...
      </book>
  </body>
</doi_batch>

(disclaimer: that's all the DOI metadata I'm prepared to learn for the purpose of providing an example)

philbarker commented 4 years ago

Just to be clear:

AP ID External ID AP Label value type value space
Book sdo:Book Book

Should not be read as Book is an identifier for the same thing as sdo:Book any more than

@book rdf:type instance of sdo:Book

should be read as the shape @book is an instance of sdo:Book

I think they both mean the shape called Book/@book is to be applied to instances of sdo:Book

(And BTW, I really like the convention of @xyz for shape identifiers, can we write it in as recommended practice?)

tombaker commented 4 years ago

@philbarker

I think they both mean the shape called Book/@book is to be applied to instances of sdo:Book

It may often - even frequently - make sense to associate a shape with a data node on the basis of its being an instance of a particular class. However, I'm not seeing the need to create a special construct in the profile to express this convention, especially if it means that the very clear column label Property ID must be changed to the vaguer and more abstract label External ID. If a CSV-to-ShEx conversion script were to interpret the value in that cell as the basis for a rdf:type sdo:Book statement, why not say this more clearly with a rdf:type sdo:Book statement (like the other statements)?

I would not want the design of our CSV model to imply that the class to which something being described must or even should be stated explicitly in the data.

The Crossref example uses a URI not of an OWL class, but of an XML element identifier. In this example, if the output were supposed to be XML and prefixed URIs were to be used in the way you show, might one not simply do:

Entity Shape ID Property ID Property Label Value type Value space
@book doi_batch_id DOI batch identifier URI http://www.crossref.org/schema/4.3.7

In this case, the CSV-to-XML conversion script would need to know that a doi_batch_id would be expressed in the data as an xmlns= attribute (ie, "property") of a doi_batch element, and I do not see how putting that association into a special type of row, under a column External ID would make that association any clearer (or more readable). If we want our CSV model to be simple and uniform - to a certain degree model-agnostic, even if (like DCMI metadata terms itself) it is rooted in RDF - yet translatable into different syntaxes, then the syntax-specific use of information from columns like "property ID" will need to be encoded in the conversion script.

kcoyle commented 4 years ago

@tombaker said:

The Entity Shape ID would not be found in the data because that is part of the profile, not the data profiled.

Something in the profile must make a direct connection between the entity in the profile and the entity in the instance data. That connection cannot be made by the properties because:

  1. a profile may have few if any mandatory properties, so basing the identification of entities on the existence of properties may not be possible
  2. we have not ruled out that a property, like "dct:date" could be used with more than one entity in a profile, so the presence of a property may not uniquely identify an entity.

I do not see how we can develop a profile model that does not share an entity identity with the instance data. I think this is tangentially related to your statement that a class cannot be the identifier for an entity. Even if we agree that a class (that would be used in the instance data) cannot have that role, we still need to determine what will have that role.

kcoyle commented 4 years ago

I'm coming more and more to the conclusion that we cannot develop a model that works for RDF and for non-RDF data. If you wish to prove me wrong, please do it with

  1. a profile
  2. instance data
  3. some statement of algorithms that would test the instance data against the profile
philbarker commented 4 years ago

@tombaker

It may often - even frequently - make sense to associate a shape with a data node on the basis of its being an instance of a particular class.

Yes, either you associate a shape with some specified type of data node or you say that all data nodes have the same requirements.

However, I'm not seeing the need to create a special construct in the profile to express this convention, especially if it means that the very clear column label Property ID must be changed to the vaguer and more abstract label External ID. If a CSV-to-ShEx conversion script were to interpret the value in that cell as the basis for a rdf:type sdo:Book statement, why not say this more clearly with a rdf:type sdo:Book statement (like the other statements)?

Property ID is only clear for properties. There's more in an AP than properties. I mean, we haven't yet begun to talk about how term lists get added in as value spaces. If this is to be all in a single csv, and if the csv columns are to have a common meaning down through all rows then we have a choice between a degree of abstraction or a proliferation of columns some them sparsely populated.

might one not simply do: Entity Shape ID Property ID Property Label Value type Value space
@book doi_batch_id DOI batch identifier URI http://www.crossref.org/schema/4.3.7

There might be an approach like that, but DOI batch identifier is a "Publisher generated ID that uniquely identifies the DOI submission batch." I don't see how it tells me which properties to use/expect when dealling with a book as opposed to a journal (say).

tombaker commented 4 years ago

SHACL associates nodes to shapes by pointing to specific nodes (by URI) or to nodes with a specific type arc or that are subjects or objects of arcs with a specific predicate. Similarly, ShEx associates nodes to shapes by means of a Shape Map that can be constructed by enumeration or by query, which might query for type arcs (e.g., my:PersonShape => rdf:type foaf:Person) or predicates (e.g., my:PersonShape => foaf:knows).

I see the association of shapes to nodes as something that is out of scope for the profile per se (it is at any rate out of scope for a ShEx schema, which I see as one possible expression of a profile). Indeed, one might want to use shapes for validation in different ways. I do not at any rate see this as one of the requirements identified in an earlier stage of this discussion.