Defining value constraints

kcoyle commented 3 years ago

We have said that value constraints are additional rules to be applied to the value datatype. We have also considered that the value constraint could be a regex, ShExC (or ShExJ) code. Other constraints could be simple "formulas" like

GT 13
> 1900
"admin*"

Can we define rules for value constraints?

Keep in mind that one role for profiles is to convey the information about instance metadata to "foreign" users so ideally anyone should be able to understand the constraints without having inside information.

tombaker commented 3 years ago

Someone (Ben? John?) suggested that we distinguish constraint_type and constraint_value. Without making such a distinction, it is impossible (or at any rate risky) to guess what the examples above really mean:

GT 13 - could conceivably be interpreted as a literal with a space, a regular expression, or a pick list of two literals: GT and 13.
>1900 - could be a literal, regular expression, or mathematical express ("greater than 1900")
`"admin*" - etc...

Consider: value_type	constraint_value	constraint_type	annotation
Literal	GT 13
Literal	GT 13	Regex	requires all three
Literal	GT 13	LiteralPicklist
LiteralPicklist	GT 13

The literal pick list could be handled with two columns if "value type" were not limited to URI, Literal, Non-Literal, and BNode, but the Regex could not.

briesenberg07 commented 3 years ago

The literal pick list could be handled with two columns if "value type" were not limited to URI, Literal, Non-Literal, and BNode, but the Regex could not.

If we wish to express a value type--we may roughly equate this with a datatype--in the column value_type, then it doesn't make sense for something like LiteralPicklist to appear there, as values in the instance data will not be literal picklists.

kcoyle commented 3 years ago

Option 1: One column, assess values by pattern

In this option, value constraints are identified uing their characteristics:

"blah" "bleh" "bluh" = a literal list
regex:(/[.+-?^${}()[]\]/g)) = regex (some other indicator can be used instead of "regex:" - I just made that up)
LT 13 GT 2 = a formula (also <13 >2) (or use ShEx form: MinInclusive 13 MaxInclusive 20) (or just min 2 max 12)
https://id.loc.gov/subjects* (for a URI stem) (or uristem:id.loc.gov/subjects) (or uristem:https://id.loc.gov/subjects)

Assume that we would choose one way to code each possibility

Advantages:

one column
users can learn patterns but don't have to know how to name them

Disadvantages:

could be complicated to code for correctly
will be difficult to detect badly expressed patterns vs. just a complicated pattern

Some questons:

which regex?!
are there other types of constraints that we need to add here, e.g. language tag lists

kcoyle commented 3 years ago

Option 2: One column for constraint type, separate column for the pattern

In this option each constraint will have a type associated with it.

type	example
pick list	"red" "blue" "green"
formula	<13 >2
uristem	https://id.loc.gov/subjects
regex	/[.*+-?^${}()	[]\]/g, '\$&'); // $&

The "constraint type" could be expressed as actions, such as: "select one of" for "pick list"; "beginning with" for "uristem". I can't immediately think of more, but we could probably come up with them.

Advantages:

clear designation of type of constraint
easier to determine if constraint is expressed in a valid way
profile creators can specify constraints that their applications understand

Disadvantages:

We have to determine a specific set of constraints that all applications need to understand
The number of types of constraints may grow very large in actual usage

briesenberg07 commented 3 years ago

I'm in the Option 2 camp at this time, because I think it offers more clarity*.

But the disadvantages that @kcoyle points out are worth consideration!

How many constraint types will we surface and need to include in our set if we go down this path?
We'll need to settle on one way to express each in the vocabulary (?)--"What kind of regex?" etc.

You could easily imagine how this could make a simple model significantly more complex.

To be clear, what I'm thinking of at the current time is something like*:	value_type	constraint_type	constraint_value
URI	URI stem	https://id.loc.gov/subjects
literal	pick list	'red' 'blue' 'green'

etc.

johnhuck commented 3 years ago

It seems to me that Option 1 may not avoid the need to define constraint types, and problem of their proliferation; since it displaces it with a need to define syntaxes (and circulate those definitions). That makes Option 2 more appealing to me.

kcoyle commented 3 years ago

This is now being discussed at https://github.com/dcmi/dctap/issues/5, which links back to here.

dcmi / dcap

Defining value constraints #63