dcmi / dctap

DC Tabular Application Profile
https://dcmi.github.io/dctap/
33 stars 10 forks source link

Value constraints and their types #5

Closed kcoyle closed 3 years ago

kcoyle commented 3 years ago

The TAP has a column for valueType, using XML Schema literal types like: xsd:string xsd:date xsd:integer

Although this indicates the type, in many cases there are additional constraints that are desired:

and the valueNodeType can indicate that the value will be an IRI, from which there may also be additional constraints:

What are the needed additional constraints? Can we develop a short(-ish) list to incorporate into the TAP?

Original discussion in DCAP repo

tombaker commented 3 years ago

@kcoyle Okay, I'll bite... How about the following as value constraint types:

Handling minimum and maximum might be trickier. One clean way would be to add two optional columns to the model):

Another, possibly messier way to handle this would be with

AFAICT, these cover the main options we have discussed to date. They would add up to six or seven, which I think is quite enough to demonstrate the basic mechanism of using valueConstraintType together with a valueConstraint. We could add one or two more - or possibly reduce the number by collapsing the variants of picklist into just one type, "Picklist". However, I do not think we should add five or ten more - at least for Version 1. If the DCTAP model gets taken up and used, we could consider expanding the list of supported types, but for now I think we should just focus on the obvious ones, the low-hanging fruit.

kcoyle commented 3 years ago

Another option is to always allow multiples, so everything is in essence a pick list, even though it may only be a pick list of one. Multiples would always be ORs. That would work for:

but not so much for regex's, so those could be an exception.

It also may not be necessary to give a constraint type for a literal. In this example from the meeting: Screen Shot 2021-01-04 at 10 44 56 AM

If the valueConstraint was: History Science Arts then you may not need to say it is a LITERAL picklist because you presumably have already designated the value as a literal in your value type of xsd:string. I also wonder about the row with a value of rdf:type - In that case the "string" in the valueConstraint column is the value of that property, and multiples could be OR.

philbarker commented 3 years ago

I also wonder about the row with a value of rdf:type - In that case the "string" in the valueConstraint column is the value of that property, and multiples could be OR.

Yes, I think that works. You can repeat the row for AND. This works nicely when you want to say that a resource must be types as a sdo:LearningResource and some other type (such as Book, Video ...)

Also, pick lists of one item seem fine to me.

kcoyle commented 3 years ago

Proposal

Updated Feb 5 to include picklist

  1. Values in valueConstraint can be single values or a list of delimited values. The valueConstraintType defines all values in the valueConstraint cell, whether a single value or a list. Multiple values in the valueConstraint cell are processed in a logical "or" relation. Thus the string: A, B, C is processed as A or B or C
  2. The following are the pre-defined valueConstraintTypes: picklist, IRIstem, pattern (regex), languageTag.
  3. When the constraint is a list of string values (red, blue, green) the valueConstraintType is picklist.
  4. When the constraint is a single value, no valueConstraintType is used. This latter indicates that the valueConstraint is treated as a single string regardless of possible delimiter characters (such as the comma) embedded within the string.
  5. The documentation will state that other types are allowed, including code snippets (e.g. ShEx statements), and it is recommended that those be given a valueConstraintType that is likely to be understood by downstream users of the profile.

Single string value:

propertyID valueDatatype valueConstraint valueConstraintType
dct:subject xsd:string Smith, Jane

List of string values:

propertyID valueDatatype valueConstraint valueConstraintType
dct:subject xsd:string History,Science,Art picklist

Constraint type defined in statement constraints

shapeID propertyID valueNodeType valueConstraint valueConstraintType
author rdf:type IRI foaf:Person

One or more IRI stems

propertyID valueNodeType valueDatatype valueConstraint valueConstraintType
dct:subject IRI http://id.loc.gov, http://vocab.getty.edu IRIstem

regex

propertyID valueNodeType valueDatatype valueConstraint valueConstraintType
schema:typicalAgeRange literal xsd:string /^[0-9]{1,2}-?[0-9]{0,2}$/ pattern

language tags

propertyID valueDatatype valueConstraint valueConstraintType
dct:subject xsd:string @en,@fr,@de languageTag
philbarker commented 3 years ago

Can we add something like: "Should implementers find that white space delimiters are not viable, other characters may be used. However there will need to be some mechanism for communicating what character is being used so that it can be recognised by software processing the TAP. Such mechanisms may not be interoperable."

Example where this might be necessary:

propertyID valueDatatype valueConstraint
dc:subject xsd:string English language English literature Fine art

Actually, that seems like it would be quite common. Perhaps we ought to suggest a fall back option such as |?

philbarker commented 3 years ago

BTW, a better regex for schema:typicalAgeRange would be /^[0-9]{1,2}-?[0-9]{0,2}$/ (matches e.g. 7-9, 11- doesn't allow things that can be interpreted as ranges of numbers, e.g. "all ages")

In case you're worrying about age discrimination, I think 99- covers content intended for centurions :-)

kcoyle commented 3 years ago

Can we add something like: "Should implementers find that white space delimiters are not viable, other characters may be used.

@philbarker I was planning on getting that into issue #4, but agree that something also needs to be said where we define constraint types. I'll ponder where/how to word that.

And I'll use your regex. I was intending to use xsd:integer, but then the "-" got in the way.

kcoyle commented 3 years ago

There are two other XML Schema properties that we might want to consider:

valueDatatype valueConstraint valueConstraintType
xsd:integer 'ne 0' xsd:assertion
valueConstraint valueConstraintType
[0-9]{5}(-[0-9]{4})? xsd:pattern

XML schema also has minLength and maxLength. Although we probably can't combine them, a simple:

valueDatatype valueConstraint valueConstraintType
xsd:string 3 xsd:minLength

would assert the minimum length of a string. I will try to go through the xsd document for other useful properties. We could say that by identifying the constraint from a standard, like XML Schema, it can be used in the TAP. Then we could give a few obvious examples. One could be from ShEx. I can't think of others at the moment, so reply if you know of others.

philbarker commented 3 years ago

I thought regexes would work for patterns, but if you find xsd:pattern friendlier it would make sense to include that.

kcoyle commented 3 years ago

Phil,

I edited the proposal with these suggestions. Take a second look:

https://github.com/dcmi/dctap/issues/5#issuecomment-756189578

The explanation about whitespace is hard to do succinctly. I'm going to try on issue #4, and in the document we can refer to a section that explains it. Wish me luck.

Thanks, kc

On 1/7/21 8:20 AM, Phil Barker wrote:

Can we add something like: "Should implementers find that white space delimiters are not viable, other characters may be used. However there will need to be some mechanism for communicating what character is being used so that it can be recognised by software processing the TAP. Such mechanisms may not be interoperable."

Example where this might be necessary:

propertyID valueDatatype valueConstraint dc:subject xsd:string English language English literature Fine art

Actually, that seems like it would be quite common. Perhaps we ought to suggest a fall back option such as |||?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dcmi/dctap/issues/5#issuecomment-756218999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL53YPIICKZZFJWUYM2J6LSYXNMPANCNFSM4U6IAM6A.

-- Karen Coyle kcoyle@kcoyle.net http://kcoyle.net skype: kcoylenet

philbarker commented 3 years ago

@kcoyle text looks better now. I would change the sentence

Thus the string "A B C" is processed as "A or B or C".

to

Thus A B C is processed as A or B or C.

to avoid confusion as to what happens if you use double quotes around a value that has white space in it (such as "A B C")

Good luck :-)

kcoyle commented 3 years ago

Thanks, Phil - done.

tombaker commented 3 years ago

I like the examples above, with a few exceptions:

As xsd:assertion and xsd:pattern are datatypes, I find it confusing for them not to be listed in the valueDatatype column. What would remove the confusion, for me, would be to call them XSDAssertion and Pattern in the valueConstraintType column (though their definitions could cite XSD).

kcoyle commented 3 years ago

See discussion in https://github.com/dcmi/dcap/issues/61

kcoyle commented 3 years ago

Closing because values were decided and are in draft of primer. Opened a discussion for additional types, as discussed above in comment.