dcmi / dctap

DC Tabular Application Profile
https://dcmi.github.io/dctap/
34 stars 10 forks source link

Consequences of the non-repeatability of value constraint types #86

Open tombaker opened 2 years ago

tombaker commented 2 years ago

Trying to implement minLength/maxLength and minInclusive/maxInclusive clarified for me a vague unease I had felt about adding these to the core model. Taking the types in the DCTAP Primer one-by-one:

Because each DCTAP instance is a flat grid, one can only have one valueConstraint/valueConstraintType pair per Statement Template. The types listed above are more or less mutually exclusive. Perhaps one could come up with edge cases, but I do not see any obvious ways that one would normally want to combine "pattern" and "picklist", or "picklist" and "language tag", etc.

However, with minLength/maxLength or minInclusive/maxInclusive, one would quite often want to use both:

In order to allow both facets (of a pair of facets) to be specified in a single Statement Template, one would need to implement them as separate columns.

Alternatively, the pairs could be combined into single columns taking single value ranges, something like:

However, this would imply conventions around how to parse ranges (eg, "1-10", "1..10"...) which I believe we have so far successfully avoided.

I also note that, for the sake of clarity, we chose not to combine the two columns mandatory and repeatable into one single column, which could in principle have been taken a small set of range values such as: "0..1" (not mandatory, not repeatable), "1..1" (mandatory, not repeatable), "0..n" (not mandatory, repeatable), "1..n" (mandatory, repeatable).

tombaker commented 2 years ago

PROPOSAL:

Given that:

I propose:

philbarker commented 2 years ago

@tombaker

one would quite often want to use both:

  • "Age of students ranges from 13 to 19."
  • "Identifiers must have between 5 and 12 characters".

In cases like that I have used a range in the 15..19 with valueRange / lengthRange (I think) as the constraintTypes. I think it is OK (not ideal, but OK). I think the main reason for avoiding the similar approach for cardinality was to align with how non-technical people thought about cardinality -- that's not the case here.

That said, there may be cases where a similar approach won't work.

I don't like the approach of adding more columns because already the tabular format is getting very unfriendly to use: it has too many columns. My yardstick for "too many" is the number of columns you see in standards docs like DCAT. I think that we have sacrificed usability for pedantic correctness too often.

My preferred solution is:

  1. acknowledge that TAP won't cover every case with equal ease;
  2. if all the constraints on a how to use a property cannot be expressed in one statement constraint, have more than one statementConstraint for the property.
philbarker commented 2 years ago

PS: I don't mind whether the solution to providing maxLength / maxValue etc is in the primer or the cookbook, though I think it is such a common case, and the solution is so well & widely established, that it should be in the primer.

tombaker commented 2 years ago

@philbarker My interpretation of your points, with commentaries:

Bottom line:

kcoyle commented 2 years ago

I thought that a solution was offered that used two rows:

age.ex / minValue / 5 age.ex / maxValue / 18

With this kind of constraint I would assume they would commonly be mandatory and not repeatable. I read this as:

If not mandatory (and not repeatable), the rule is essentially the same, but conditional on the existence of the property:

This is an instance where making the property repeatable would require a shape (and sounds illogical to me, but there may be a use case).

That said, if someone wants to use ranges in the valueConstraint we could give examples in the Cookbook.

tombaker commented 2 years ago

it would not require the use of multiple statement templates just to express a range, a notion that seems non-intuitive and hard to explain, not just for the authors of a DCTAP instance, but for any of its many more readers, who cannot be assumed to have read the Primer.

Put more strongly: having two statement templates about the same property within a shape is NOT a usage pattern we should recommend because it is inherently confusing. As Karen pointed out last year: "I still assume that each row is validated separately, and that an incoming triple can be valid for one statement constraint but invalid for another. The question is what a validation program is expected to do with that situation, and of course that could vary based on the application." The examples above tend to reinforce the idea that this is ambiguous.

Most application profiles I have seen have only one statement template per property (per shape), and I would assert that its commonly understood purpose is to "close" a property. Example: If a ST says that the value of dct:creator is an IRI, then a triple that uses dct:creator with a string value would not conform to the ST. The idea that one could create a detailed statement template by, in effect, "adding" two or more statement constraints together, is not inituitive. To take an extreme example, would:

shapeID propertyID mandatory repeatable valueNodeType valueDataType valueConstraint valueConstraintType
subject dct:subject false true LITERAL xsd:string @en,@fr languageTag

mean the same as the following?

shapeID propertyID mandatory repeatable valueNodeType valueDataType valueConstraint valueConstraintType
subject dct:subject LITERAL
dct:subject false true xsd:string
dct:subject @en,@fr languageTag

I do not think it is our role to forbid (or sanction) such interpretations. I only mean to say that I think we simply should not go there.

kcoyle commented 2 years ago

@tombaker

The primer says:

Each row is to be interpreted independently of all other rows, and a propertyID can appear on more than one row. When used for validation, all rules on a single row must be part of the validation logic.

Having separate statement templates for min and max is compatible with this. Another example is the need to say the same property can be either an IRI or a string, and there are further constraints, such as an iristem on the IRI. I have an example of that in the tutorial documents, and there is a section for that in the Cookbook. The tutorial example represents an "OR" situation, while the Cookbook example reflects the question in this issue.

For your example above, if the rule is that each row is evaluated on its own merits, then your second example has the same validation result as the first. However, because that statement template can be expressed on a single row the separation into rows is unnecessary. In the example in the tutorial documents, a single row cannot be used:

propertyID propertyLabel valueNodeType valueDataType mandatory repeatable valueConstraint valueConstraintType
dct:creator Author IRI   FALSE TRUE http://id.loc.gov/authorities iriStem
dct:creator Author literal xsd:string FALSE TRUE    

This was presented in #76 (see Phil's comment) and in the meeting of August 4 we agreed to this solution. It is included in the Cookbook but not in the Primer, as per Phil's comment.

Because I don't see a standard way to indicate ranges in the XSD datatype documentation, I think we could add ranges to the Cookbook. It would be great to find some previously defined range element that we are comfortable with to suggest there.

tombaker commented 2 years ago

@kcoyle

Each row is to be interpreted independently of all other rows, and a propertyID can appear on more than one row.

This did not raise any red flags when I read it awhile ago, perhaps because a given propertyID can appear in more than one shape and this does not actually say that propertyIDs can appear on more than one row within a given shape. I was wondering how I had somehow missed this decision and see that it was taken on August 4, when I was on break. (And I see that @johnhuck was also not on the call.)

I do not think the idea that a propertyID can appear on multiple rows should be part of the Primer. I think it is fine to present this pattern as one possible interpretation of a DCTAP in the Cookbook or even in the Tutorial, but not as part of the core model as presented in the Primer.

The Primer presents the minimum model that needs to be supported by developers creating applications based on DCTAP (eg, dctap-python).

In my examples above, the STs do not contradict each other and, as you point out, they could be expressed in a single ST. In the tutorial example, however, the two STs do in fact contradict each other and could therefore not be merged. Any statement using dct:creator would fail to match at least one of these two STs.

This may be a sensible way to use DCTAP in applications that have special logic for handling such situations, but we should avoid raising the bar for applications that simply want to implement the most common model for an application profile, in which each property is associated with just one set of constraints.

It is worth mentioning that the commonly understood model is also the default for ShEx. In ShEx, if a shape contains a triple constraint with a given predicate, it is said to "mention" the given predicate. By default, a triple constraint "closes" the mentioned predicate, which means that for a given shape, every outgoing arc using the mentioned predicate must match a triple constraint in the shape. (This would not be the case in your example because any statement with dct:creator would fail to match at least one of the STs.) In ShEx, one can modify this behavior by tagging the shape with EXTRA dct:creator, which means that the shape could accept any number of dct:creator arcs using different triple constraints. But this is not the default - very sensibly, in my opinion.

To be clear, I do not want to forbid this interpretation of multiple statement constraints on a mentioned predicate within a shape. I can see how this would be a useful workaround in some situations, and I think it is fine to include examples in the Cookbook. I believe I have consistently argued that the semantics of DCTAP should be defined weakly enough to accommodate multiple interpretations.

Bottom line: I would be happy if this notion were simply removed from the Primer itself.

tombaker commented 2 years ago

@kcoyle In my reading, you were making more or less the same point as I make above with regard to applications:

I still assume that each row is validated separately, and that an incoming triple can be valid for one statement constraint but invalid for another. The question is what a validation program is expected to do with that situation, and of course that could vary based on the application.

I note that the issue is still open...

philbarker commented 2 years ago

Back to https://github.com/dcmi/dctap/issues/86#issuecomment-1305721279 :

I think you are proposing that we keep maxLength, maxValue, etc in the Primer, but I would prefer your idea of valueRange / lengthRange as value constraint types because it would not require the use of multiple statement templates just to express a range

The problem with valueRange / lengthRange is that it's doubly tricky to deal with whether the limits are inclusive or not. It also means that you have to choose a lower bound just to express a maximum permitted value. For a non-integer non-inclusive maximum of 18 of a value that can be 0, that means the lower bound is inclusive of 0. This leads to notations like [0;18) where the choice of braces indicates whether the bound is inclusive. If the lower range is unspecified you need a convention for how to indicate that, e.g. [;18]

kcoyle commented 2 years ago

I note that the issue is still open...

I'm trying to figure out why I marked that it should be closed but didn't close it. I think it was just a clicking error. I will close.

I'm not sure that I understand your objections, Tom. We've had this situation in our documentation for a while, and I believe that my point in #51 is the opposite of what you are proposing. I see multiple statement templates as a solution to some common use cases, like "IRI or literal". We worked out how multiple property/statement templates within a shape would be evaluated, and I actually think it's quite elegant. Phil has given us a number of examples. The Google Doc lays out pretty clearly how it would work.

That said, examples like this are only in the Cookbook. The primer has that one sentence. We probably need a link from the Primer to the Cookbook in the valueConstraintType area but the Cookbook isn't ready for that yet - it is incomplete and would need better formatting. Can we assume that such a link would be made from that Primer section to the Cookbook, using a note something like "See section x of the Cookbook for when both a minimum and a maximum are needed."?