Consequences of the non-repeatability of value constraint types

dcmi / dctap

DC Tabular Application Profile

https://dcmi.github.io/dctap/

34 stars 10 forks source link

Consequences of the non-repeatability of value constraint types #86

Open tombaker opened 2 years ago

tombaker commented 2 years ago

Trying to implement minLength/maxLength and minInclusive/maxInclusive clarified for me a vague unease I had felt about adding these to the core model. Taking the types in the DCTAP Primer one-by-one:

valueConstraintType picklist: The valueNodeType is "literal", and the valueConstraint is a list of alternative literals.
valueConstraintType IRIstem: The valueNodeType is "IRI", and the valueConstraint is to be interpreted as a base IRI.
valueConstraintType pattern: The valueNodeType is "literal", and the valueConstraint is a regular expression to match that literal.
valueConstraintType languageTag: The valueNodeType is "literal", and the valueConstraint is a language code (or list of alternative codes) used to tag that literal. (Of course, a single literal can only be tagged with one single language code.)

Because each DCTAP instance is a flat grid, one can only have one valueConstraint/valueConstraintType pair per Statement Template. The types listed above are more or less mutually exclusive. Perhaps one could come up with edge cases, but I do not see any obvious ways that one would normally want to combine "pattern" and "picklist", or "picklist" and "language tag", etc.

However, with minLength/maxLength or minInclusive/maxInclusive, one would quite often want to use both:

"Age of students ranges from 13 to 19."
"Identifiers must have between 5 and 12 characters".

In order to allow both facets (of a pair of facets) to be specified in a single Statement Template, one would need to implement them as separate columns.

Alternatively, the pairs could be combined into single columns taking single value ranges, something like:

numericRange: "Age of students is '13-19'".
stringLengthRange: "Identifiers must have '5-12' characters".

However, this would imply conventions around how to parse ranges (eg, "1-10", "1..10"...) which I believe we have so far successfully avoided.

I also note that, for the sake of clarity, we chose not to combine the two columns mandatory and repeatable into one single column, which could in principle have been taken a small set of range values such as: "0..1" (not mandatory, not repeatable), "1..1" (mandatory, not repeatable), "0..n" (not mandatory, repeatable), "1..n" (mandatory, repeatable).

tombaker commented 2 years ago

PROPOSAL:

Given that:

permissible (or expected) lengths of strings and values of numbers are frequently (even normally) expressed as ranges, and
ranges are often expressed in spreadsheets with two columns (one for minimum and one for maximum), but that
some implementers may prefer to express ranges as single values
we want to keep the core DCTAP model as simple as possible

I propose:

that we move the pairs minLength/maxLength and minInclusive/maxInclusive from the Primer to the Cookbook
that we show each of these implemented as a pair of columns
that we point out that such pairs could be implemented, alternatively, using single columns with range values.

philbarker commented 2 years ago

@tombaker

one would quite often want to use both:

"Age of students ranges from 13 to 19."

"Identifiers must have between 5 and 12 characters".

In cases like that I have used a range in the 15..19 with valueRange / lengthRange (I think) as the constraintTypes. I think it is OK (not ideal, but OK). I think the main reason for avoiding the similar approach for cardinality was to align with how non-technical people thought about cardinality -- that's not the case here.

That said, there may be cases where a similar approach won't work.

I don't like the approach of adding more columns because already the tabular format is getting very unfriendly to use: it has too many columns. My yardstick for "too many" is the number of columns you see in standards docs like DCAT. I think that we have sacrificed usability for pedantic correctness too often.

My preferred solution is:

acknowledge that TAP won't cover every case with equal ease;
if all the constraints on a how to use a property cannot be expressed in one statement constraint, have more than one statementConstraint for the property.

philbarker commented 2 years ago

PS: I don't mind whether the solution to providing maxLength / maxValue etc is in the primer or the cookbook, though I think it is such a common case, and the solution is so well & widely established, that it should be in the primer.

tombaker commented 2 years ago

@philbarker My interpretation of your points, with commentaries:

The core DCTAP tabular format already has enough columns - and I agree!
You use valueRange and lengthRange as value constraint types - I could live with this, and these are better than the names I penciled into my example (above).
We have sacrificed usability for pedantic correctness - I am a bit surprised to hear this but think we can simply emphasize in the examples in the Primer that all columns (except propertyID) are optional. For example, specifying a valueDataType (eg, "xsd:string") will in many cases be unnecessary.
If all constraints on a property cannot be expressed in one statement template, use more than one statement template - I do not believe the Primer says how two statement templates on the same property are meant to be interpreted (ie, that statements must conform to either, or must conform to both), nor do I think we should necessarily go there - at least, not in the Primer.

Bottom line:

I think you are proposing that we keep maxLength, maxValue, etc in the Primer, but I would prefer your idea of valueRange / lengthRange as value constraint types because it would not require the use of multiple statement templates just to express a range, a notion that seems non-intuitive and hard to explain, not just for the authors of a DCTAP instance, but for any of its many more readers, who cannot be assumed to have read the Primer.
As to whether they belong in the core DCTAP model, I take your point that they are commonly needed so could go either way.
In either case, we would need to point out that there is more than one conventional syntax for a range (but leave it to implementers to pick their preferred syntax).

kcoyle commented 2 years ago

I thought that a solution was offered that used two rows:

age.ex / minValue / 5 age.ex / maxValue / 18

With this kind of constraint I would assume they would commonly be mandatory and not repeatable. I read this as:

age.ex MUST be minValue 5
age.ex MUST be maxValue 18

If not mandatory (and not repeatable), the rule is essentially the same, but conditional on the existence of the property:

IF age.ex then MUST be minValue 5
IF age.ex then Must be maxValue 18

This is an instance where making the property repeatable would require a shape (and sounds illogical to me, but there may be a use case).

That said, if someone wants to use ranges in the valueConstraint we could give examples in the Cookbook.

tombaker commented 2 years ago

it would not require the use of multiple statement templates just to express a range, a notion that seems non-intuitive and hard to explain, not just for the authors of a DCTAP instance, but for any of its many more readers, who cannot be assumed to have read the Primer.

Put more strongly: having two statement templates about the same property within a shape is NOT a usage pattern we should recommend because it is inherently confusing. As Karen pointed out last year: "I still assume that each row is validated separately, and that an incoming triple can be valid for one statement constraint but invalid for another. The question is what a validation program is expected to do with that situation, and of course that could vary based on the application." The examples above tend to reinforce the idea that this is ambiguous.

Most application profiles I have seen have only one statement template per property (per shape), and I would assert that its commonly understood purpose is to "close" a property. Example: If a ST says that the value of dct:creator is an IRI, then a triple that uses dct:creator with a string value would not conform to the ST. The idea that one could create a detailed statement template by, in effect, "adding" two or more statement constraints together, is not inituitive. To take an extreme example, would:

shapeID	propertyID	mandatory	repeatable	valueNodeType	valueDataType	valueConstraint	valueConstraintType
subject	dct:subject	false	true	LITERAL	xsd:string	@en,@fr	languageTag

mean the same as the following?

shapeID	propertyID	mandatory	repeatable
subject	dct:subject	LITERAL
dct:subject	false	true	xsd:string
dct:subject	@en,@fr	languageTag

I do not think it is our role to forbid (or sanction) such interpretations. I only mean to say that I think we simply should not go there.

kcoyle commented 2 years ago

@tombaker

The primer says:

Each row is to be interpreted independently of all other rows, and a propertyID can appear on more than one row. When used for validation, all rules on a single row must be part of the validation logic.

Having separate statement templates for min and max is compatible with this. Another example is the need to say the same property can be either an IRI or a string, and there are further constraints, such as an iristem on the IRI. I have an example of that in the tutorial documents, and there is a section for that in the Cookbook. The tutorial example represents an "OR" situation, while the Cookbook example reflects the question in this issue.

For your example above, if the rule is that each row is evaluated on its own merits, then your second example has the same validation result as the first. However, because that statement template can be expressed on a single row the separation into rows is unnecessary. In the example in the tutorial documents, a single row cannot be used:

propertyID	propertyLabel	valueNodeType	valueDataType	mandatory	repeatable	valueConstraint	valueConstraintType
dct:creator	Author	IRI		FALSE	TRUE	http://id.loc.gov/authorities	iriStem
dct:creator	Author	literal	xsd:string	FALSE	TRUE

This was presented in #76 (see Phil's comment) and in the meeting of August 4 we agreed to this solution. It is included in the Cookbook but not in the Primer, as per Phil's comment.

Because I don't see a standard way to indicate ranges in the XSD datatype documentation, I think we could add ranges to the Cookbook. It would be great to find some previously defined range element that we are comfortable with to suggest there.

tombaker commented 2 years ago

@kcoyle

Each row is to be interpreted independently of all other rows, and a propertyID can appear on more than one row.

This did not raise any red flags when I read it awhile ago, perhaps because a given propertyID can appear in more than one shape and this does not actually say that propertyIDs can appear on more than one row within a given shape. I was wondering how I had somehow missed this decision and see that it was taken on August 4, when I was on break. (And I see that @johnhuck was also not on the call.)

I do not think the idea that a propertyID can appear on multiple rows should be part of the Primer. I think it is fine to present this pattern as one possible interpretation of a DCTAP in the Cookbook or even in the Tutorial, but not as part of the core model as presented in the Primer.

The Primer presents the minimum model that needs to be supported by developers creating applications based on DCTAP (eg, dctap-python).

In my examples above, the STs do not contradict each other and, as you point out, they could be expressed in a single ST. In the tutorial example, however, the two STs do in fact contradict each other and could therefore not be merged. Any statement using dct:creator would fail to match at least one of these two STs.

This may be a sensible way to use DCTAP in applications that have special logic for handling such situations, but we should avoid raising the bar for applications that simply want to implement the most common model for an application profile, in which each property is associated with just one set of constraints.

It is worth mentioning that the commonly understood model is also the default for ShEx. In ShEx, if a shape contains a triple constraint with a given predicate, it is said to "mention" the given predicate. By default, a triple constraint "closes" the mentioned predicate, which means that for a given shape, every outgoing arc using the mentioned predicate must match a triple constraint in the shape. (This would not be the case in your example because any statement with dct:creator would fail to match at least one of the STs.) In ShEx, one can modify this behavior by tagging the shape with EXTRA dct:creator, which means that the shape could accept any number of dct:creator arcs using different triple constraints. But this is not the default - very sensibly, in my opinion.

To be clear, I do not want to forbid this interpretation of multiple statement constraints on a mentioned predicate within a shape. I can see how this would be a useful workaround in some situations, and I think it is fine to include examples in the Cookbook. I believe I have consistently argued that the semantics of DCTAP should be defined weakly enough to accommodate multiple interpretations.

Bottom line: I would be happy if this notion were simply removed from the Primer itself.

tombaker commented 2 years ago

@kcoyle In my reading, you were making more or less the same point as I make above with regard to applications:

I still assume that each row is validated separately, and that an incoming triple can be valid for one statement constraint but invalid for another. The question is what a validation program is expected to do with that situation, and of course that could vary based on the application.

I note that the issue is still open...

philbarker commented 2 years ago

Back to https://github.com/dcmi/dctap/issues/86#issuecomment-1305721279 :

I think you are proposing that we keep maxLength, maxValue, etc in the Primer, but I would prefer your idea of valueRange / lengthRange as value constraint types because it would not require the use of multiple statement templates just to express a range

The problem with valueRange / lengthRange is that it's doubly tricky to deal with whether the limits are inclusive or not. It also means that you have to choose a lower bound just to express a maximum permitted value. For a non-integer non-inclusive maximum of 18 of a value that can be 0, that means the lower bound is inclusive of 0. This leads to notations like [0;18) where the choice of braces indicates whether the bound is inclusive. If the lower range is unspecified you need a convention for how to indicate that, e.g. [;18]

kcoyle commented 2 years ago

I note that the issue is still open...

I'm trying to figure out why I marked that it should be closed but didn't close it. I think it was just a clicking error. I will close.

I'm not sure that I understand your objections, Tom. We've had this situation in our documentation for a while, and I believe that my point in #51 is the opposite of what you are proposing. I see multiple statement templates as a solution to some common use cases, like "IRI or literal". We worked out how multiple property/statement templates within a shape would be evaluated, and I actually think it's quite elegant. Phil has given us a number of examples. The Google Doc lays out pretty clearly how it would work.

That said, examples like this are only in the Cookbook. The primer has that one sentence. We probably need a link from the Primer to the Cookbook in the valueConstraintType area but the Cookbook isn't ready for that yet - it is incomplete and would need better formatting. Can we assume that such a link would be made from that Primer section to the Cookbook, using a note something like "See section x of the Cookbook for when both a minimum and a maximum are needed."?