Harmonize reason attributes

rettinghaus commented 2 months ago

@reason is available for the elements gap, secl, supplied, surplus, and unclear. Mostly they are based on teidata.word, however the first and the last one take the long way using teidata.enumerated instead. Is there a reason for this? I do not see any, so maybe it would be good to have all @reason attributes modeled in the same way.

sydb commented 1 month ago

Good point — it is probably better if those five cases of @reason were defined consistently. On the other hand, the two that are defined as enumerations are, in fact, enumerations (for <gap> the list of enumerations is "cancelled", "deleted", "editorial", "illegible", "inaudible", "irrelevant", and "sampling"; for <unclear> it is "illegible", "inaudible", "faded", "background_noise", and "eccentric_ductus"; in both cases the list is of “suggested values”). So those two should remain defined as teidata.enumerated. Seems to me the other three probably should be teidata.enumerated, too, both to match and because enumerations for these (to me) makes sense, and can provide for tighter validation and thus better encoding.

It is worth reviewing (my interpretation of) the semantics of these two datatypes. The teidata.word datatype was originally intended as nothing more than a way to provide a “string without funny characters that are likely to be problematic when parsing” sort of datatype, roughly analogous to (but with not quite the same restrictions as) the Nmtoken of XML 4th edition. Because the words “string”, “text”, and “token” were already taken, it was named (poorly, in retrospect) “word”. However, within a few years the meaning morphed into more of “single token that has its own semantics” sorta thing. In any case, its meaning is quite distinct from teidata.enumeration, which represents the exact same syntax, but which means “there is (or should be) a controlled vocabulary for this”. The controlled vocabularies are provided with the <valList> element, and come in three flavors:

closed (“legal values are”) — this is the list of possible values, thou shalt not use any others
semi (“suggested values include”) — this is a list of applicable values. If your case matches one of these cases, you should use the suggested value. If your case does not match any of these cases, you should make up your own value in the same vein.
open (“sample values include”) — this is a list of sample values. You might want to use these, you might not.

In some cases the Guidelines do not actually provide a controlled vocabulary at all. I think the semantics of these cases is “you, the customizer writing the ODD customization for a TEI project, should provide a controlled vocabulary for this (but we’re not going to provide any helpful suggestions)”.

Of the 169 cases of teidata.enumerated in the Guidelines,

135 have a controlled vocabulary (48 closed, 45 open, and 41 semi), and
34 do not.

Given that the @reason of <gap> and <unclear> are already enumerations, and that the descriptions of @reason of <secl>, <supplied>, and <surplus> each include at least one sample value (and clearly are not intended to be plain text), I am (pretty strongly) of the opinion they should all be teidata.enumerated. And, in general, it seems to me to make sense for the Guidelines to provide vocabulary lists wherever possible. (I am not personally qualified to come up with a list for <secl>, but could probably handle the other two.)

But the other question to ponder is whether or not @reason should be in an attribute class of which these 5 elements would be members, each providing its own <valList>.

rettinghaus commented 1 month ago

@sydb Thanks for sharing your thoughts. Based on your explanation I agree that teidata.enumerated is the better datatype here.

TEIC / TEI

Harmonize reason attributes #2580