USE CASE: Profile schema.org for learning materials

philbarker commented 5 years ago

(...or more generally, profile big spec for specific use)

Creator: Phil Barker

Problem statement

schema.org, like space, is big. You just won't believe how vastly, hugely, mind-bogglingly big it is. It can be used to describe everything from bus trips to medical conditions, from people to volcanoes. The shear number of terms is problematic to anyone creating a system for describing a subset of things, such as learning materials, or more specific still, textbooks.

It would useful to be able to specify a subset of schema.org drawing on relevant types and properties, and optionally restricting the value spaces for properties, that gave a more manageable number of terms.

Stakeholders

People creating systems for describing specific classes of resources don't want to code for entities that they will never describe. People wanting to describe specific classes of resources don't want to be distracted & befuddled by a plethora options that are not relevant to them.

Links

schema.org full type hierarchy

Requirements

The application profile must specify which terms from the base spec are included in the profile in a machine readable way The application profile must specify any changes to domains and ranges of the included terms The application profile may specify cardinality rules schema.org is evolving and so the application profile must deal with changes in the base spec The application profile may specify additional restrictions to value spaces for properties, for example that a certain controlled vocabulary be used.

Comments

paulwalk commented 5 years ago

In this use-case, is the application profile limited to the single base specification (or "namespace"), or could it draw on terms from other namespaces in addition to schema.org? I think you intend the former (restricted to using terms from schema.org) in this use-case.

I only raise this because I think it points to the fact that we might entertain two kinds of AP - those which are, essentially, adding constraints to an existing single base-specification or namespace, and those which do this but also add a "pick-and-mix" aspect, drawing on multiple namespaces.

philbarker commented 5 years ago

I don't think we need two kinds of AP. "Pick and mix from n specifications" works for n=1 as a special case, not a different class.

This is in keeping with Patel and Heery, "We define application profiles as schemas which consist of data elements drawn from one or more namespaces"

This use case emphasizes the 'pick' aspect of the functionality. Whether it uses more than one base specification depends on whether you consider the specifications that provide the value spaces for properties (i.e. encoding schemes, controlled vocabularies etc.) as base specifications.

paulwalk commented 5 years ago

I agree that, from a modelling point of view, one of these cases is best considered as a special case of the other, rather than in a different class.

However, I note in passing that in the case where an application profile uses only terms from a single base specification, and adds no other constraints, then it is identical in form to a 'base specification'.

I do appreciate that this is not quite what you describe in your use-case, so let's treat this as a passing comment only :-)

philbarker commented 5 years ago

Sure, "use what has been picked from a base specification" is different from "use a base specification"; very different when the base specification is very big.

analice1pt commented 5 years ago

@philbarker , I think this is very important and is related to the use case I proposed at issue #11. Very nice.

kcoyle commented 5 years ago

For the W3C DXWG work I had written (a draft) that covers types of profiles, and it goes like this:

profiles that are subsets of a larger vocabulary. These reduce the vocabulary terms of a broad data standard to a smaller number of terms that are useful for a particular community member or application. An example of this is BIBFRA.me, which is designed for library materials and defines both a core set of terms as well as profiles for specialized communities such as cataloging of rare materials or early printing trade. In this community, all profiles use only terms from a single vocabulary.
profiles that can both reduce and extend a base standard. These profiles are developed by members of a data-sharing community but for reasons of jurisdiction or specialization need to add terms beyond the base standard vocabulary in order to meet their needs. They may also omit terms from the base standard that are not relevant to their implementations. An example of this is data catalog vocabulary standard, DCAT, its primary profile, DCAT-AP, and the national variants (DCAT-AT-IT, DCAT-AP-NO, DCAT-AT-DE). While maintaining overall compatibility with the larger data catalog community, each of these profiles adds needed terms for the local variant. These profiles generally make use of terms from more than one namespace.
profiles that amend a base standard by inheriting or overriding values of that standard. The example here is of the Open Digital Rights Language (ODRL) which is a language to support rights in the use of digital content in publishing, distribution, and consumption of digital media. The ODRL language encodes a policy that has a core vocabulary that can be extended or overridden by individual instances called "profiles."
profiles that use some vocabulary terms from multiple standards without having a strong relationship to any base standard. These profiles develop new groupings of existing terms as vocabularies and may define new terms as needed. An example of this is the Asset Description Metadata Schema (ADMS) vocabulary [vocab-adms]

The first one meets Phil's schema.org example. We might want to develop a general statement about how variable profiles can be, and still be profiles.

philbarker commented 5 years ago

@kcoyle I am interested in the idea of a profile 'overriding values' of the base standard as in your third bullet point. Could you expand?

kcoyle commented 5 years ago

@philbarker Well, that's a very specific case, and it is lightly explained in the ODRL model. The concept of overriding appears in the "conflict" area where it says: "Additional conflict property values MAY be defined by ODRL Profiles". I don't guarantee that I have fully understood this, but it seems that they are saying that one could define profiles of profiles (or of a core profile) and then have rules for what to do if the profiles conflict - so if the base profile says that the value of X must be a string and another profile says that the value of X must be an IRI, what is a processing application supposed to do in this case? For ODRL, you can decide that either the base profile value or the derived profile value is the valid one.

It may well be that this does not mesh with the way that we are using profile and thus we can ignore it. However, some of the DXWG folks are interested in "profile inheritance" for which some conflict rules would need to be defined.

The question of "profile inheritance" is another can of worms that we will need to address, even if our conclusion is that we will remain silent on that particular functionality. (Of course, first we'd have to define it.)

philbarker commented 5 years ago

thank you @kcoyle that's interesting. It seemed odd to have a profile that does not conform to the base specification (whatever that means--it's normally instances that conform), but I see several ways that two APs of the same spec could conflict. As well as constraining the value space as you describe, there could be conflicting rules on cardinality. So I think instances that conform with the AP should conform with the base spec, but not every instance that conforms with the base spec will conform with the AP.

analice1pt commented 5 years ago

@kcoyle , I agree with your list, but If I understood well this use case, we need more. We need to be able to indicate that only specific parts of controlled vocabularies may be used as values. For instance, we may select as range of a property one or two branches of a controlled vocabulary. @philbarker was this what you were meaning or am I extrapolating? We should make a glossary because when people use the word vocabulary I don't know if they are meaning vocabulary of metadata elements / properties or controlled vocabularies of possible values. I now use metadata schema for a vocabulary of properties and vocabulary/controlled vocabulary for a vocabulary of values.

philbarker commented 5 years ago

@analice1pt you are extrapolating from what I meant, but maybe in a useful way.

I sympathise with your point about 'vocabulary', but in some ways (at least from an rdf point of view) terms that represent classes and properties are similar to terms that represent items in a concept scheme. Can an application profile define a subset of a specification that happens to be in SKOS? Doing so could be an element of the use case I wrote, though I hadn't thought of it.

analice1pt commented 5 years ago

@philbarker , I have a problem with this phrase: "terms that represent classes and properties are similar to terms that represent items in a concept scheme", because in SKOS the terms are instances of the skos:Concept class and even the vocabulary is an instance of the skos:ConceptScheme class. I mean, the terms and the vocabularies defined in SKOS are instances of classes, not classes.

This is another issue, but then, when we use a vocabulary scheme defined in SKOS as the range of a given RDF property, we are using instances and not classes (as defined in the range of the rdfs:range property).

philbarker commented 5 years ago

@analice1pt by similar I meant that they are all names/labels of parts of a graph that are identified by URIs. Does it matter (as far as APs are concerned) if they are defined in RDFS, SKOS or OWL? As you say, in SKOS the terms are instances of the skos:Concept class, but skos:Concept is an instance of owl:Class. Of course they are not the same, but they look very similar when I visualize the graph as circles and arrows. I don't want to push the point because I don't think I am saying anything that you don't know, and because I agree with your original statement that distinguishing "vocabulary meaning vocabulary of metadata elements / properties or controlled vocabularies of possible values", even if we say that APs can cover both.

analice1pt commented 5 years ago

@philbarker , I now understand what you mean. Thank you for the clarification.

In any case, when we decide what an AP is and when we decide how to express whatever we want to express and whatever we name it, I think it will be important to decide if we will define ranges or acceptable values for properties, because formally they are not the same.

tombaker commented 5 years ago

@analice1pt

We need to be able to indicate that only specific parts of controlled vocabularies may be used as values.

In a practical sense, I see two ways to do that: by specifying a list of URIs (or labels!) of terms from a controlled vocabulary in the AP. Or by creating a document which lists those terms, perhaps giving that document a name, and pointing to the named subset by URI. In the former sense, the selection would be within the scope of the AP, and in the latter sense it would be outside of the AP. This would also be the case for subsets of a SKOS concept scheme.

kcoyle commented 5 years ago

Note that this was a requirement in the shapes working group - that one can specify a list of URI "fragments" (e.g. the domain name plus whatever levels are appropriate) from within which a value must come. It has been baked into SHACL and I believe also ShEx. This would allow one to use terms from those very long term lists, like Library of Congress Subject Headings, or geonames. Just saying "must be a skos concept" would not be specific enough, yet making a copy of the list would be prohibitive.

@analice1pt If you are creating a profile that is specifically to support RDF metadata, then I see no reason why both ranges and values could not be defined, although they would have to be compatible. My personal experience is that ranges are often general ("literal", "URI") while I would expect value constraints to be needed for more specific situations, such as when the values must come from a particular list. Also, I don't know of a way to define values in RDF that are limited to specific classes using anything except ranges. So if you want to say that in your profile "dct:Creator" has to have as its value an entity of class "myProf:Person" then you need to use a range.

Note that in my thinking about the vocabulary I was not assuming an RDF profile, so I haven't included domains and ranges for the properties. It is possible that we would need an RDF-specific version of our vocabulary... I haven't thought that far, though.

analice1pt commented 5 years ago

@tombaker , I was thinking about using a URI to a term and, somehow (we define how), express that all terms "under" that term, or directly related to that term, may be used as values. I mean, we do not need to be constrained by what we can currently do in RDF.

tombaker commented 5 years ago

@analice1pt I can see how a profile might want to specify a statement shape that uses, say, "dc:creator or any subproperty thereof" or "any property in the dcterms: namespace, together with a value URI" - I think this is what you mean?

Such a statement shape could not itself be expressed in RDF, but for the former, one could use an RDFS schema with a reasoner to infer the set of allowable properties, while the latter could be handled by ShEx. While these ideas may only be expressible in the profile by using natural language (not RDF), RDF data based on those profiles could nonetheless be tested for conformance to those shapes using languages such as ShEx.

I think we need to ask whether such things, which may be easy to express in words, are within scope of a simple profile language. I personally think we should focus on covering the simple use cases while acknowledging that some things that are easy to express in words may require an expressive conformance language such as ShEx to express in a formal, actionable sense. In other words, I think we should set some clear limits and aim for a simple, "core" language, and leave it to other, more expressive but also harder-to-use languages to implement those ideas in practice.

analice1pt commented 5 years ago

"'dc:creator or any subproperty thereof' or 'any property in the dcterms: namespace, together with a value URI' - I think this is what you mean?"

No no no. I was meaning URI to a term in a SKOS vocabulary, meaning that that branch of terms could be used as values. But the case you propose is very nice.

@tombaker , I understand and agree what you say about simple use cases. However, in the design process we need something that comes earlier than ShEx. Then its result may be refined in ShEX or equivalent, but we will need to express some things about our data before we arrive to the ShEx phase, and some of those things are not addressed by simple use cases.

kcoyle commented 5 years ago

@analice1pt Can you mock up an example of what you mean? Because I'm not sure what a "branch of terms" would be, from a single SKOS term. Thanks.

analice1pt commented 5 years ago

@kcoyle, in this example I am thinking about hierarchical structures. See picture, please. IMG_20190510_100918_resized_20190510_100952842

kcoyle commented 5 years ago

@analice1pt So it looks like you are saying that you want to say that what is "valid" in your profile is a SKOS concept and any of the narrower SKOS terms related to that concept. Maybe @tombaker can weigh in as our resident SKOS expert. I don't think I have seen anything like this implemented anywhere, and the SKOS IRIs don't reflect the conceptual structure, AFAIK. They also are not subclasses, so standard inferencing wouldn't work. But let's hear from Tom!

analice1pt commented 5 years ago

@kcoyle / @tombaker , when I propose this, I am not thinking in how it can be currently implemented. I am only thinking in abstract terms and hoping it can be implemented now or later. I think we should not be constrained by the current state-of-the-art on technologies / languages that may or may not implement what we would like to implement. I think we can, and probably should, drive future implementations.

kcoyle commented 5 years ago

@analice1pt That's fine, but a bit more of information on what you are thinking would still help. We have an example where values must come from the same namespace. What isn't clear to me is if your example is about classes/subclasses or some other relationship. I doubt if we can come up with requirements without more information. Also, I think this is a new use case.

analice1pt commented 5 years ago

Two a bit more complex examples of what I think we could address.

20190517UnionControlledVocabularies

philbarker commented 5 years ago

@analice1pt Yes, those are use cases that resonate with me too. I think that when you get to example 3, probably the solution would be to declare your own vocabulary V(own) = {e, e1, e2, D, E, E1, E2}, then maybe you have two application profiles, because V(own) is and application profile of V3 and V4, which is referenced by or a part of the application profile of schema.org.

analice1pt commented 5 years ago

@philbarker , if we do Example 1 (the first example I gave, unlabeled) and example 2, it is easy to do example 3, I think. But, yes, I agree that we may have other solutions. What I think is that the AP should, or at least could, represent several design options.

Needless to say that my position is intimately related to my view that an AP is a model of the data we are dealing with. A Linked Data model.

kcoyle commented 5 years ago

I have been assuming that different data designs would be expressed as different APs. You would apply the AP that matches that data. This requires that a single AP know about other APs. I think we should add "mapping from one AP to another" to our requirements and then see if we can fulfill that or if that's a separate step. My gut feeling is that this is not going to be in our first version, because we don't yet know what an AP looks like.

analice1pt commented 5 years ago

I think I was not clear when I said that "the AP should, or at least could, represent several design options." I should have said that I think that our notation or whatever we invent/draw/design in this IG should allow an AP designer/creator to express her/his design options. I did not intend to say that the AP itself should express different alternatives, but I agree that is what can be read from my words.

tombaker commented 5 years ago

@analice1pt @kcoyle

So it looks like you are saying that you want to say that what is "valid" in your profile is a SKOS concept and any of the narrower SKOS terms related to that concept. Maybe @tombaker can weigh in as our resident SKOS expert. I don't think I have seen anything like this implemented anywhere, and the SKOS IRIs don't reflect the conceptual structure, AFAIK. They also are not subclasses, so standard inferencing wouldn't work. But let's hear from Tom!

It is true that skos:narrower and skos:broader are not transitive in SKOS. However, skos:broader is (somewhat counterintuitively) declared as a sub-property of skos:broaderTransitive. See this example.

Bottom line: SKOS B/N relations can be treated as transitive, so one could infer a sub-tree of a SKOS concept scheme holding all concepts (transitively) narrower than a given concept.

tombaker commented 5 years ago

@analice1pt

I think that our notation or whatever we invent/draw/design in this IG should allow an AP designer/creator to express her/his design options.

If we distinguish:

What the designer wants to express.
What the simplified, readable notation that results from this IG can express.
What a more sophisticated but harder-to-use language such as ShEx can express.

I would want the simple notation (2) to cover the most common use cases but would not expect it to express everything that a designer might want to express (1). I would expect the more sophisticated language (3) to be capable of expressing most of the things not covered by (2); however, I'd also expect there to be cases involving complex judgement calls by humans, for examples, that would be impractical or impossible to encode even using the more sophisticated schema language (3) and would require natural language.

analice1pt commented 5 years ago

@tombaker , I agree. But there are two things that I think that are important: 1 - We should not be constrained by current possible implementations. We can, and probably should, drive change. 2 - The AP may be used for validation but it is not the only reason why an AP is developed. It may be developed to support the design process, for the application developer to know what data the application will need to handle and how it will be encoded.

dcmi / dcap