Basic structure - Githubissues

kcoyle commented 5 years ago

We need to settle on a basic structure (or at least a concept thereof) for profiles. In the work that I did, I started with the structure expressed in the DSP: dsp-uml The information there about values (everything under Statement) comes from the earlier work on the Dublin Core Abstract Model and I think it's more detailed than we need. I would suggest that we begin with something like:


Description Set (the top level of the profile
    Entity description (for the "things" being described in the profile)
        Statement (essentially the property if one is thinking in terms of RDF)
            Value 
                Value constraints

The basic rules could be:
- A description set has one or more entity descriptions
- An entity description has one or more statements
- A statement has one value
- A value can have various constraints (the set of which we will need to define)

analice1pt commented 5 years ago

@kcoyle , I have never liked the concept of Description Set Profile. I think it is confusing and collides with Heery and Patel's concept of AP. Instead of DSP, I would prefer to use AP.

kcoyle commented 5 years ago

@analice1pt Please say what you think the structure is as described by Heery and Patel.

nichtich commented 5 years ago

So we agree all data is structured in "entities" (aka records), "properties" (aka fields) and "values"? What about subfields, lists and hierarchies? At least subfields are very common in ad-hoc data that most required profiles. For instance

a name value (given as plain string) could be grouped into given and surname
an author field could be used to express a list of names instead of repeating the field (not easily doable in tabular data)
a date field could contain qualifiers such as "circa" and "?"

If data was already given in clean RDF there was less need to define APs, the hard cases are elsewhere.

kcoyle commented 5 years ago

@nichtich I don't see "entities" as "records" - in RDF terms, they might be graphs (in ShEx terms they might be "shapes"), that is, a set of properties with a common subject. A metadata "record" (which is approximately what the Description Set describes) could have multiple entities within it, such as document, person, publisher. Each of those would be described with properties. As for "subfields, lists, hierarchies" these are types of values that one should be able to express, and I don't see any real problem with them:

depending on your metadata model, a name value could be a string with some subfield encoding (as in MARC or ISBD), or it could be a graph with properties (name graph with given and surname properties)
lists can be structural arrays or could be comma-delimited strings (which can be included in tabular data) - how this works depends on your schema language
to qualify a value you either define a string that can contain elements like "circa", you use a standard that includes that in its encoding, or you define a graph that has a date and a qualifier (in the spirit of SKOSXL - saying things about strings)

I would be interested to hear why you think what has been said so far does not include these capabilities, because I never thought they would be excluded.

analice1pt commented 5 years ago

@kcoyle , I like your structure. I just don't like the term "Description Set Profile" because I think it is confusing if we take as a basis Heery and Patel's definition of AP (we will have to extend it, but the core is there) instead of the Singapore Framework (SF). So, I would just change one thing: instead of "A description set has one or more entity descriptions", I would put "An application profile has one or more entity descriptions", or something around that.

analice1pt commented 5 years ago

@nichtich , I agree we have to acknowledge that much of the data we'll be dealing with is tabular data. I also agree we'll have to deal with it in a different way than we deal with regular descritpions of "things" (e.g., I also don't like the repetition of the field/property). Maybe it's a good idea to describe the structure of the tabular data, like the approach of [1]. I don't know that specification, I've just browsed through it, but I think it is related to some of the issues you mention.

analice1pt commented 5 years ago

@nichtich , sorry, I forgot to add [1] - http://www.w3.org/TR/tabular-metadata/

tombaker commented 5 years ago

@analice1pt I note that the broader view of "application profile", as in Singapore Framework, is closer to the view currently under discussion in the W3C Data Exchange WG -- "application profile" as a cluster of related documents. I really like Karen's proposal. To me, it follows more naturally to say that a Description Set is a set of Entity Descriptions that to say that an Application Profile is a set of Entity Descriptions.

analice1pt commented 5 years ago

@tombaker , I understand that but that is a perspective that is completely different from Heery and Patel's, which is, in my view, the one most used by the community. When people refer to AP, they are not thinking in a cluster of documents. If we do a quick scan over the literature we will notice that.

kcoyle commented 5 years ago

This is why we need to clarify our terms. I agree that when folks talk about "profiles" or "application profiles" they generally mean a main document that describes their selected terms, values, and rules. In fact the DSP is subtitled: "A constraint language for Dublin Core Application Profiles" even though it does not include all of the boxes of the AP levels of the Singapore Framework. I suppose you could argue that things like domain models and vocabularies (that's the SF term) are inherent in the DSP, but it's pretty telling that even there we find confusion between what is and what isn't included when we say "application profile."

By the way, I think that we can have a "constraint language for application profiles" and also have more than one document that supports an AP, kind of like the DXWG situation. Having an AP expressed in a constraint languges doesn't mean that you don't have other documents like primers, various serializations of your validation code (SHACL, ShEx) that are separate from the base document, etc.

Also, note that the BIBFRAME profiles use:

Profile
   Resource
      Property

instead of the DSP terms although they are essentially the same as the DSP concepts (BIBFRAME profiles were developed out of the DCAP work). I've started a comparitive table with DSP, BIBFRAME, Sinopia, etc. so we can see what equivalencies we might find between different models. It is incomplete and I'm quite happy to get help with it.

analice1pt commented 5 years ago

Again, for me an AP is a data model. It can be expressed in several ways: a text, a (probably multidimensional) matrix, a formal language...

kcoyle commented 5 years ago

@analice1pt I agree that it can be expressed in various ways. The question is whether we give it some structure. The structure being proposed orients around the description of "things" in the user's data model. Thinking in terms of something like UML, each entity would be described with its related properties. We can change what we call things, but I'm wondering if this structure would work: dspDiagram3 This is expressed as a data model but could be implemented as a document or in a number of different formats.

Feel free to share a model that you prefer - I'd like to see more ideas.

analice1pt commented 5 years ago

I agree, @kcoyle and I find this diagram very interesting.

My issue is another one: for me a data model may be expressed in many ways, including text (not desirable, but possible). Of course, I prefer formal, even graphic ways (E-R diagram, a graph...). As I see it, an AP could be represented using an extended version of an E-R diagram, for instance, but also as an ontology or a (possibly multidimensional) matrix.

This is intimately related with the definition of AP.

kcoyle commented 5 years ago

Ana, I agree that this could be a text. I'd point to the DCAT-AP except that it's darned hard to get to, but it's a PDF that describes, in text and tables, something that could be described as the structure above. So it's not about the graphic, but about the concepts: that an AP is made up of entity (resource) descriptions that have statements (properties) with defined values. Does that work?

nishad commented 5 years ago

@kcoyle I prefer most of the structure to be optional, which helps the users to create at least a minimal DSP.

Something like :

Bare Minimal Structure -> Minimal/Basic DSP Complete Structure -> Complete/Full DSP Extended Structure (Maybe with validation, examples, etc.) -> Advanced/Extended DSP

This may help the adoption curve to be smoother, and still, we could encourage communities to improve their DSPs to complete or extended.

kcoyle commented 5 years ago

@nishad I think we're talking past each other. There needs to be a conceptual structure that has nothing to do with the formal schema. Without a conceptual model we can't model anything. What are the concepts that will make up our definition of an AP? We could opt for flat (like DC Terms) but my gut feeling is that a profile requires more than that.

Let's try an example in words:

My profile is about things that are books
Books have properties (qualities) like titles, pagination, date of publication
Books also have authors
Authors are people who have certain properties/qualities (names, birth dates)

We have two logical entities: books and people. This means that we have a two-entity structure, and those two entities have relationships to each other. We could obviously also have a set of data that has only one entity, but I think the primary use case will be for more than one. If folks don't agree with that, we could develop a bare minimal flat structure (although it would really just be a subset of a more complete structure).

In my view, within the DSP, nothing would be mandatory, so creators of profiles could be as minimal as possible: one entity, one property.

tombaker commented 5 years ago

@analice1pt I take the point that what the Singapore Framework calls the DSP is what many (or most) people call the Application Profile. I guess I am okay with using that as the top level, as long as there is a sentence somewhere that says that an application profile is basically a set of descriptions.

@kcoyle For the next layer, I still like "entity description". "Resource description" would be perfect, but it carries far too much baggage (eg, baked into the names of RDF, RDA...).

tombaker commented 5 years ago

I see two styles of naming here:

one that focuses on the nature of things "in the real world" that are being described: entities such as books and their properties (qualities). BIBFRAME's use of Resource fits this style.
one that focuses on the structure of things "in your data" - e.g., descriptions and statements.

To me, APs are about things "in the data". If you really want to describe things "in reality" (and not "in your data"), then you have an ontology, which may be useful for applications that draw conclusions, make decisions, or plug information gaps on the basis of inferencing. But if you really want to describe things "your data", for example to ensure coherence and support conformance validation, you need an application profile.

To describe things "in your data", it is important to distinguish between "things actually in your data", and the template you define for describing the things in your data. Something like:

Structure of your data instances        About the structure of your data instances

Graph or Record (aka Description Set)   Application Profile (aka Description Set Template)
    Entity Description                      Entity Description Template
        Statement                               Statement Template
            Property                                Property Template
            Value                                   Value Template

This is not too far from Mikael's Description Set Profile constraint language. Mikael distinguished "templates" (used to express structures) and "constraints" (used to limit those structures), which I still think is a good way to understand the problem.

analice1pt commented 5 years ago

@kcoyle , for it works fine that "(...) an AP is made up of entity (resource) descriptions that have statements (properties) with defined values".

kcoyle commented 5 years ago

@tombaker I find the use of the word "template" in DSP a bit hard to wrap my head around. I consider templates to be quite prescriptive, and my preference for our work here is to enable but to prescribe as little as possible. DSP was more prescriptive than I would like us to develop today. As such, I don't think of the potential vocabulary to be a template - any more than DC Terms is a template. But of course you could use the vocabulary to create templates that users fill in.

This becomes very meta, because, like Sinopia, you could have templates exposed by the software for the creation of profiles, so our profile vocabulary gets profiled when someone uses it to create their profile. (This is where my brain stops working, due to meta on meta...)

kcoyle commented 5 years ago

@analice1pt I believe that this diagram represents "(...) an AP is made up of entity (resource) descriptions that have statements (properties) with defined values". I could change the wording if that is preferable. The extra corners are to show that there can be multiple entity descriptions and property descriptions, but there is only one (logical) set of value constraints for each property.

Does this express what you were thinking?

newDSP 001

tombaker commented 5 years ago

@kcoyle I do not think of templates as inherently being heavily prescriptive. Prescriptive yes, but not necessarily heavily. For example, a statement template which says nothing more than "there may be a statement using dct:format" is prescriptive, but not heavily.

On the other hand, if template has the wrong feel, this could be an excellent place to use the term shape in a generic sense. ShEx and SHACL both use shape, albeit in slightly different ways, but if you squint there is a common notion of something that specifies the prescribed or expected structure and content of data.

@kcoyle As I write this, the diagram above has appeared. I agree with value constraint instead of value template but would go a step further and say property constraint instead of property description which happens also to be the distinction that Mikael made (see first posting in this thread, above). So I would like to replace my earlier proposal with:

Structure/content of your data          About the structure/content of your data

Graph or Record                         Application Profile
    Entity Description                      Entity Description Shape
        Statement                               Statement Shape
            Property                                Property Constraint
            Value                                   Value Constraint

tombaker commented 5 years ago

@analice1pt I'm not sure I would go so far as to say an AP must define values. Conceivably, a simple profile could say that data should use Creator, Date, Title, and Format, but remain silent about the type or content of values.

kcoyle commented 5 years ago

@tombaker What makes a property constraint a constraint? (I think we need to define constraint before going much further.)

And although it is not "required" that templates are prescriptive, that's just how they strike me. I think of templates as being quite concrete. I can see calling this a template:

but I have a harder time using that term with the underlying vocabulary. This may just be me, but I'd like to try this out with a larger group.

tombaker commented 5 years ago

@kcoyle I like how Mikael made the distinction: templates express structures, and constraints limit those structures. Or as I would now prefer to say it: shapes express structures, and constraints limit those structures.

To put it (perhaps too) concretely, Entity Description Shape and Statement Shape draw boxes, then Property Constraints and Value Constraints limit the content of those boxes. If we could agree on that, we should be able to come up with a simple definition of constraint. I'm thinking that for our purposes, it would be nice if we could define shapes versus constraints at such an intuitive level, then rely on examples as illustration, without trying to get too abstract and technical.

kcoyle commented 5 years ago

@tombaker Where I have problems with using constraints for the statement level is how it fits with the act of creating a profile. Let's say you define an entity for your AP, and to describe it you list in your AP 3 properties from vocabulary A and 2 properties from vocabulary B. And that's all you do - no further rules or definitions. Is that a constraint? Or is it only a constraint if you define something like cardinality?

This is the same problem that we have had with the DXWG's PROF use of "constraint" - that "adding" a property to your profile is defined as a constraint, and I don't think that fits logically with how most people use the term "constraint". In fact, in the dictionary it is associated with the concept of "a limitation". So expanding your profile by adding terms is a limitation. While you CAN make the case that making a choice from the wide world of vocabularies is a limitation/constraint, I fear that is not going to make intuitive sense to a person seeing the term "constraint". At the same time, trying to make the distinction that including a property is itself not a constraint but adding any limitations on that property is -- well, again, I don't think it'll work without a fair amount of explanation.

I prefer using "description" instead of constraints for this reason: that some of the acts of creating a profile will not "feel" like limitations to those creating the profiles.

tombaker commented 5 years ago

@kcoyle

Where I have problems with using constraints for the statement level is how it fits with the act of creating a profile. Let's say you define an entity for your AP, and to describe it you list in your AP 3 properties from vocabulary A and 2 properties from vocabulary B. And that's all you do - no further rules or definitions. Is that a constraint? Or is it only a constraint if you define something like cardinality?

A thought experiment: Let's suppose that the minimal well-formed application profile might say: "Zero or one statements using any property" (i.e., to say anything about anything, or even nothing about anything).

Let's suppose that to do this, the AP would express an Entity Description Shape and a Statement Shape (ShEx actually calls it a Triple Constraint), and that the content of that Statement Shape would be a property constraint of "" and a value constraint of "" (where "*" means "anything or nothing").

In this sense, the shapes provide the boxes, and the constraints limit (or in this case, declare "no limit") on the contents of those boxes. In 99% of real-world cases, one would name the property (constraining the set of all properties to just one) and typically also constrain the value.

That said, I think our task here is to come up with a model that is relatively concise and easy to grok, without much extra jargon, and that translates on the back end into more expressive languages such as ShEx. And we should strive to define a model that aligns with those more expressive languages, but without trying to match the expressivity of those languages and without getting too hung up on terminology.

analice1pt commented 5 years ago

@nishad , I like your proposal of having part of the structure optional.

@kcoyle , I agree “nothing will be mandatory”. In order to address @nishad ’s concerns, maybe we could give examples at different levels of complexity / coverage.

@kcoyle and @tombaker , I clearly prefer entity description than resource description. IMHO, when referring to entities we mean abstract things, whilst when referring resources these are instances, concrete things. I vote for leaving resources to RDF.

@kcoyle and @tombaker , I agree with @kcoyle when she says that her preference "for our work here is to enable but to prescribe as little as possible”. When designing an AP we want to express what data we have, how it is constrained, how it relates and how it relates to other, external, data. We can derive a template from here, but I would leave that to other phases such as validation. IMHO, an AP has much more to do with design than with validation.

@kcoyle , yes that picture expresses what I was thinking. However, recalling @nichtich concerns, I am now thinking that the picture reflects well the concepts behind traditional RDF descriptions, but I am not sure it is well suited for tabular data or even if it should be. My intuition says we should think better on this subject.

@tombaker, I am not sure I fully understand your proposal. What is a record? Is it a file? For me a record may be part of a file and it holds the description of something (a book, a person,...). And for me the notion of shape is the notion of something with boundaries. However, our "shapes" to not have boundaries, they are open. We may decide to use the name “shape” in order to align it with other specifications, but otherwise I am sure we can find a much better name, of something unbounded, related, open.

I also agree with @kcoyle that the notion of constraint may not be clear at all for everybody.

marianamalta commented 5 years ago

"I clearly prefer entity description than resource description. IMHO, when referring to entities we mean abstract things, whilst when referring resources these are instances, concrete things. I vote for leaving resources to RDF."

I totally agree with @analice1pt

tombaker commented 5 years ago

@analice1pt

We may decide to use the name “shape” in order to align it with other specifications, but otherwise I am sure we can find a much better name, of something unbounded, related, open.

We could use template, as in the DSP spec, but I'm not hearing much consensus for that. But I actually prefer shape and, at any rate, think we should avoid introducing Yet Another term when either template or shape, I would argue, are good enough.

What is a record? Is it a file? For me a record may be part of a file and it holds the description of something (a book, a person,...). And for me the notion of shape is the notion of something with boundaries. However, our "shapes" to not have boundaries, they are open.

I think I see what you are driving at but would like to make some distinctions. I agree that metadata does not always come in a bounded "record". Graph-based data, for example, can always be extended by merging in new triples. To me, a shape is like a lens, or filter, for focusing in on certain aspects of that graph. You could have two million triples about actors, musicians, and athletes, and you could write a Singer Shape which says that if a person is described with certain properties (like "sings") or classes ("Singer"), you expect to find a triple about that person's "voice type" (soprano, alto, tenor, baritone...). You could match that four-line Singer Shape against the two million triples and it would only focus in on the places in the graph that match your shape.

I wonder if this notion of shape meets your preference for something unbounded, because in this scenario the data is, in a sense, unbounded - or at any rate it might be unfixed or constantly changing. While the shape itself could either be "open" (which would mean: "the data might have this, this, and this"), or "closed" (meaning: "the data must have this, this, and this - and nothing else).

analice1pt commented 5 years ago

@tombaker: "We could use template, as in the DSP spec, but I'm not hearing much consensus for that. But I actually prefer shape and, at any rate, think we should avoid introducing Yet Another term when either template or shape, I would argue, are good enough."

Not Yet Another term is a quite good argument that always convinces me. I agree. :-)

tombaker commented 5 years ago

@analice1pt I am open to your arguments for considering Application Profile to be the top level (see my latest proposal above). However, I just noticed that the Program for Cooperative Cataloging defines MAP in a way that is more in line with the notion of MAP as a set of related specs in the manner of Singapore Framework:

A MAP may be a multipart specification, with human-readable and machine-readable aspects, sometimes in a single file, sometimes in multiple files (e.g., a human-readable file that may include input rules, a machine-readable vocabulary, and a validation schema).

marianamalta commented 5 years ago

An AP is a multipart specification that defines a data model. This specification has human-readable and machine-readable aspects.

analice1pt commented 5 years ago

@tombaker and @marianamalta, the Singapore Framework (SF) associates an AP with a set of documents that are related to the design process, not the final results of that process. I am much more inclined to agree with a definition that points to the final results of the process, than to all results / deliverables in a pack. This means that I am not shocked with the definition of the Program for Cooperative Cataloging. The way I read it, it is much closer to Heery and Patel's definition than to the SF. What I do not like in the SF is that the AP includes things such as the specification of the requirements, or of the Domain Model.

kcoyle commented 5 years ago

I also read SF as a procedure of a kind. It shows the background information that goes into the creation of a (hopefully) machine-readable profile. At the same time, it shows there a guidance document which may be separate from the machine-readable profile.* I don't see any problem with that. There is no reason why these two documents can't have links to each other.

While I would like to see some guidance in a machine-readable profile, if nothing else so that the user creating data can quickly see some information about the required data, there are guidance rules like Resource Description and Access that are thousands of pages long and need to be separate. Note that the profile editor for BIBFRAME (Sinopia) allows you to embed a link to the online version of the rules to the rule for that property. Something like RDA is much more complex than we need to address, I believe.

tombaker commented 3 years ago

@kcoyle Great discussion, but issues have continued elsewhere. Close?

kcoyle commented 2 years ago

Done. see dc tap

dcmi / dcap

Basic structure #15