TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
278 stars 88 forks source link

teidata.pointer equivalent to move/@where #1769

Closed joeytakeda closed 4 years ago

joeytakeda commented 6 years ago

The Map of Early Modern London is continuing to encode mayoral pageants, which took place across a number of places throughout London. These documents detail both the performances at particular sites as well as how the entire show moved from one place to another. We would like to be able to encode these movements using the <move> element and link these to our database of places, but currently move/@where is defined as teidata.word. Is there an equivalent to denote a particular place defined by an entity? For example:

<move where="locations.xml#place1"/>

Can there be a way to denote a place via @where that is a pointer (note that event/@where is a teidata.pointer and requires pointing to a <place> element)? I understand the utility of move/@where being teidata.word, so could there be a new attribute--maybe something like @wherePtr --that allowed a <move> to point to a defined place?

sydb commented 4 years ago

Indeed. I think we could also bank on any URI that does not have a ‘#’ and does not have a prefix matched by a <prefixDef> as not really being a URI. But how to word these messages and what to make red in the Guidelines, etc., will require some thought & work.

ebeshero commented 4 years ago

Okay @sydb and @martindholmes , I'd like to try the Schematron warning solution...I know we're entering the refrigeration zone tomorrow, but I think we should try to resolve this ticket for this release--I'll try something later today.

ebeshero commented 4 years ago

All--I am now not at all sure of what we should on this ticket before this next release. I see three options: 1) We proceed with the new code in my pull request, and add to it some gentle Schematron warnings, mapping from previous values as @martindholmes suggested, and perhaps also checking for some standard sorts of things we expect in xsd:anyURI. The more closely I look at xsd:anyURI, the more confused I am about how to express warnings about this, since it really does not have to begin with a #, and I'm not sure why it necessarily would need to begin with something defined in a <prefixDef> as @sydb suggests. I was thinking of trying to write a Schematron test for the phenomena that @sydb found are not consistent with xsd:anyURI (like a % not followed by two hexadecimal digits, or the wrong regex before the very first instance of a colon).

2) Or, we scrap my pull request or defer it until after a deprecation warning period. But we have to write a deprecation warning, with some Schematron code attached, as above. I think that Schematron code should try to catch those regex conditions that Syd identified.

3) A third option: revise the pull request to keep the new URI examples for @where but refrain from introducing the new attribute class, att.locatable since it could create a sudden break with people's code. But indicate that this change is coming after the standard deprecation period and to expect it, and reinforce it with Schematron warnings on certain problematic values permitted by teidata.word that will eventually break when constrained as xsd:anyURI.

Anyway, I don't know how to proceed here. Help?

ebeshero commented 4 years ago

Modifying my previous post to include a third option...and looking at this handy description of xsd:anyURI: http://www.datypic.com/sc/xsd/t-xsd_anyURI.html to help figure out a Schematron test.

ebeshero commented 4 years ago

@sydb Schematron rules I've been testing to make sure a value complies with xsd:anyURI: These three rules seem to work on some simple test code when I make simple XML examples with @where attributes that contain :, %, and/or # in various combinations that break the rules of xsd:anyURI: [revised]

<sch:pattern id="testing-if-anyURI">
    <sch:rule context="*[@where]">
        <sch:let name="values" value="tokenize(@where/string(), '\s+')"/>
        <sch:assert test="every $value in $values satisfies count(tokenize($value, '#')) le 2" role="warning">The value of the where attribute will need to conform to the xsd:anyURI datatype. The # character must only appear once.</sch:assert>
        <sch:let name="cvalues" value="tokenize(@where/string(), '\s+')[contains(., ':')]"/>
        <sch:assert test="every $cvalue in $cvalues satisfies matches(tokenize($cvalue, ':')[1], '^[A-Za-z0-9+.-]+$')" role="warning">The value of the where attribute will need to conform to the xsd:anyURI datatype. If one or more colons are used, the first colon must only be preceded by one or more simple alphanumeric characters, the plus, minus, or period symbols: [A-Za-z0-9+.-]+.
                      Finally, if the # is present it must only appear once.</sch:assert>
        <sch:let name="ptokens" value="tokenize(@where, '%')[position() gt 1]"/>
        <sch:assert test="every $ptoken in $ptokens satisfies matches($ptoken, '^[0-9a-fA-F]{2}')" role="warning">The value of the where attribute will need to conform to the xsd:anyURI datatype. If a percent sign is used it may only be followed by two hexadecimal characters (representing a codepoint).</sch:assert>
    </sch:rule>
</sch:pattern>
ebeshero commented 4 years ago

Revised / simplified the Schematron so it works on multiple attribute values of @where separated by white space.

sydb commented 4 years ago

@ebeshero: My point was that if we find a @where whose value starts with a string that matches a prefix defined in a <prefixDef>, it is a pretty safe bet that it was intentional, and that said value is a URI, we don’t have to give a “you should use a URI” warning. Same goes if the value has a ‘#’ — it is probably really a URI, and we need not issue a warning. But if the value a) does not start with a defined prefix (via <prefixDef> or known scheme like “http:”), and b) does not have a ‘#’ in it; OR c) does not match xs:anyURI, THEN we should issue a “this looks like a value, not a URI” warning. (Or some logic like that.) Note that all of the suggested values include list would thus get caught by the warning, because they all meet criteria (b). The point I am relying on here (and others may disagree with me on this) is that IF you are using a URI, THEN the value will almost certainly have a ‘#’, because you are probably not pointing to an entire document, but rather to one XML element within a document.

sydb commented 4 years ago

If I understand it correctly, I think I lean towards @ebeshero’s option (3).

sydb commented 4 years ago

@ebeshero: The Schematron heuristics you posted seem to favor detailed assistance to the user over simplicity. Why not just <sch:assert test="every $value in $values castable as xsd:anyURI">One or more of the whitespace-separated values of @where is not a URI</>? Surely most modern users know how to use URIs.

ebeshero commented 4 years ago

@sydb Are you saying I have gone and done too much to fend off possible values that wouldn’t conform to xsd:anyURI? This is meant really for anyone who is following the old way and didn’t realize their data format might be incompatible...

ebeshero commented 4 years ago

...and as you yourself pointed out, URIs are a little surprising in what they don’t permit.

sydb commented 4 years ago

True, but what I find really worrisome are not the values that are invalid as URIs (e.g., "down%left", which both your Schematron and the simple castable as xs:anyURI will catch), but rather the values that remain valid, but no longer mean what the user thinks they mean. (E.g., "down-left", which would no longer mean “look up the value down-left in the ODD file and there padawan you will learn the meaning thereof” but rather “look into a file called “down-left” that is in the same directory as this file, and read it; somewhere therein you will find the meaning you seek”.)

But I have just realized that the definition of @where on <move> is not teidata.word, rather it is teidata.enumerated. The former is just a syntactic blob with no semantics, and no mechanism for attaching semantics. The latter has a very clearly defined semantic that we are all very used to — look in the ODD file.

Thus I am feeling much worse about this change overall — unlike most of y’all, I think teidata.enumerated is a better way to define encoding semantics than a URI. But simultaneously, if this step backwards in clarity must happen, I am feeling much better about how it could happen.

Although I don’t see any chance of doing this for the upcoming release, I see at least two possible ways forward. Both rely on the (perhaps crazy) idea that these two datatypes are not incompatible. We could either say “if the value is defined in the <valList>, use the semantics therein; if not, follow it as a URI and use the semantics you find at the end of the pointer” (this is “combined”) or say “during deprecation, like any teidata.enumerated these values have to be defined in your ODD; but they also have to conform to the syntactic constraints of a URI; after deprecation they can no longer be defined in your ODD, the semantics are defined by the URI” (this is “sequential”).

Does that make sense?

ebeshero commented 4 years ago

@sydb It doesn't entirely make sense to me, because the definition of teidata.enumerated simply is teidata.word, with an added note: "Attributes using this datatype must contain a single ‘word’ which contains only letters, digits, punctuation characters, or symbols: thus it cannot include whitespace. Typically, the list of documented possibilities will be provided (or exemplified) by a value list in the associated attribute specification, expressed with a valList element." What we say is that this is "typically" handled with a value list, but doesn't have to be.

Also, when you say this doesn't have a chance of being resolved before the upcoming release, I beg to differ, having done the bulk of the work, which involved creating att.locatable and putting @where into it, and writing up a Schematron test. If we don't finish this now, this branch is going to rot and nothing will change, and I don't want to see that happen, because this has a release milestone on it. I am sure we can resolve this with this release and make a plan for deprecation. I don't want to just let this one drop, because we are three quarters of the way to resolving it.

ebeshero commented 4 years ago

@sydb Okay--the not-so-crazy idea is that teidata.enumerated isn't really all that much different from teidata.pointer, and we now have either my elaborate Schematron test or your simple one to watch for problematic values. Do we actually need a deprecation at all, or just a gentle Schematron prod? We're not removing the attribute, just altering it a very little.

ebeshero commented 4 years ago

@sydb Still thinking about this, there's a fuzziness to teidata.pointer (xsd:anyURI). It could be defined anywhere, including in the same file (as with most who point with just a simple #value). We're not really saying where this has to be defined, but also didn't stipulate with teidata.enumerated that it had to be defined only in the ODD file either. There are lots of ways of defining and pointing to canonical values in our universe.

lb42 commented 4 years ago

FWIW, my recollection is that the purpose of introducing teidata.enumerated was precisely to say that the possible values for this attribute should be defined by a valList, i.e. that they were not just any teidata.word, but should be taken from an explicit enumeration of such values. At no point was it intended that this should be (yet another) variant on teidata.pointer, else why have both? As Syd points out, a bare word value has a different semantics depending on which datatype has been specified. That is correct, and it simply muddies the waters to attempt to special case situations where you want a value declared as one to behave as if it were declared as the other.

ebeshero commented 4 years ago

@lb42 We are in an awkward space just now with this ticket due to this troubling shift in semantics, but the shift is motivated by attempting to define an attribute consistently. Now we have a special use of @where on <move> as teidata.enumerated. But it is defined differently on <event> as teidata.pointer. Council made a decision on this to move the attribute to a new class and give it the teidata.pointer consistently (rather than change the use on <event>). The ticket here has helped us to think a little more broadly about movement positions as indeed something definable in a wider range of ways than perhaps initially imagined for <move>. But really, since Council made this decision and we are ready to implement it, we just need to decide before this next release whether the change is significant enough to warrant a deprecation period, and what kinds of guidance we need to provide our community about it. I am just concerned not to let this drop—I need a decision on whether to deprecate or just make the change with very clear explanation (plus some Schematron guidance already developed).

lb42 commented 4 years ago

@ebeshero I think I am (sort of) aware of how council got into this mess, but it's still a mess. I now think the right solution is to adopt joey's original suggestion of introducing @whereRef, for the reasons given early on in this thread, chiefly for consistency with lots of other cases. And I apologise for mistakenly saying that teidata.pointer values formed a subset of teidata.word values, which they don't. I think changing the dataype of an existing attribute to suit a not entirely canonical usage of it is not something the Guidelines should ever do, unless the existing datatype is clearly wrong, which is not the case for move@where as originally defined.

ebeshero commented 4 years ago

@lb42 But Council decided this and I don’t think we were wrong. I don’t at all agree with this assessment of what we are doing. That is, we are not “changing the dataype of an existing attribute to suit a not entirely canonical usage of it”. I don’t understand this to be what we are doing at all. We are simply redefining the conditions for canonical values of @where in the interests of consistency in its usage across the Guidelines. We do not think introducing a new attribute is necessary, particularly since there is already precedent for defining @where on <event> as teidata.pointer. We don’t need at this 11th hour to be trading one source of semantic confusion for another.

ebeshero commented 4 years ago

Okay, this discussion probably indicates what we need to do with the documentation of the change. We want to encourage people to alter the use of this attribute so it is now to be a data pointer, for real, and not just conformant to the format of one. So it sounds like we should implement option 3, with deprecation notice and guidance on the significance of this change, and refrain from redefining @where in its new att.locatable class until after a deprecation period.

lb42 commented 4 years ago

@ebeshero I beg to differ. @where is currently defined (on <move>) as teidata.enumerated because its intended use is to indicate an area of the stage. It is also defined (on <event>) as teidata.pointer because its intended usage there is more general. It is worrying perhaps that attributes with the same name should have different datatypes, but that's an argument for introducing a different name (whereRef for example) not for "redefining the conditions". In my humble opinion.

ebeshero commented 4 years ago

@lb42 We would still have the problem that move/@where and event/@where are defined differently. That's a serious limitation in the current model, and I think we have seen good reason here to change the way we think about stage directions, which need not be so simple as we imagine them in "conventional" usage.

lb42 commented 4 years ago

@ebeshero possibly, but the argument could go the other way: why are we not changing event/@where ? since event is a relatively new introduction that would risk breaking fewer documents. And if you want to use pointers on your <move>, use @whereRef to make it explicit.

martindholmes commented 4 years ago

I'm with @lb42 here; I think the correct analogy is with att.edition, where (IIRC) there was originally @ed, which was teidata.enumerated, then @edRef was introduced alongside it. Joey lists some other examples of the same thing. This is a common pattern in TEI, and I don't see why it wouldn't be used this time to avoid breaking backwards compatibility for anyone.

ebeshero commented 4 years ago

Okay, all of you, this is completely baffling coming just before a release. But let's see if we can figure out what to do right now.

Do we want @where to be the same attribute on both <event> and <move>? Or do we want it to behave as it currently does, with two different datatypes on each attribute? How high a priority is it for Council to try to streamline these definitions and define something like an att.locatable which I attempted to do as directed on this ticket?

Perhaps we need to discuss that little can of worms some more and we take NO action on this ticket at all for this release. We change the status of this ticket to "Needs Discussion" and bring it up at the next meeting. If we can't be settled on an appropriate datatype for @where I don't think we can take any action at all now.

ebeshero commented 4 years ago

Even so, we could take some action right now that doesn't break anything, yes, but introducing @whereRef raises another question. What are we doing about event/@where, if anything at all? Do we even want a new attribute class? Or are reverting some old way where @where just has to have two different datatypes because that's a higher priority for us?

lb42 commented 4 years ago

Or shall I put in a ticket which says that event/@where is a mistake (because @where is already taken and has a different datatype) and should be renamed e.g. to @whereRef?

martindholmes commented 4 years ago

I think the @where on <event> is a red herring; it's intended to be a pointer to a <place> element. So even if you created att.locatable, you'd have to subclass it anyway. I think Joey's original suggestion of @whereRef alongside @where, by analogy with all the other similar pairs that already exist, is the right thing to do.

sydb commented 4 years ago

At the moment (a moment of weakness or clarity, I’m not sure which), I am inclined to:

  1. create new att.locatable class (as @ebeshero has already)
  2. give it two attributes: @where (teidata.enumerated) and @whereRef (teidata.pointer)
  3. make both <move> and <event> members of new class
  4. do some sort of deprecation warning users that the values of event/@where should be enumerated — if you want to use a URI, switch to @whereRef (over the next 2 years or whatever)
  5. Spend a few hours (well, at least minutes) arguing over whether @where and @whereRef should be mutually exclusive or not

I realize that the deprecation business may not be easy. (But it would not be easy if we went ahead with Council’s current plan, either.)

ebeshero commented 4 years ago

Okay, having talked with @sydb , @martinascholger , and @raffazizzi about this during the Stylesheets meeting today, I think we have a good plan here (in what @sydb has just outlined this morning), but we shouldn't act on it for this release because we had better have another discussion about it with the full Council. I've changed the release milestone for the moment (unless we decide at next Tuesday's Council meeting to try to implement at least a first part of it, to put deprecation warnings in place). Let's figure that part out now.

ebeshero commented 4 years ago

So the question is, is it best to take zero action on this until we've discussed with full Council? @martinascholger , let's put that on the agenda for next week's meeting.

hcayless commented 4 years ago

In Council on 2020-02-11 we discussed the possibility of solving this by creating a new datatype that would essentially be an alternation of teidata.enumerated or teidata.pointer. This seemed initially to me to be a hack, but, as I expressed then, the stakes are really low here, so I didn't care.

Upon further reflection though, it seems to me that this is actually a useful thing: a datatype that has as its value some entry in an authority list, which might be locally defined as a value list, or by some internal or external taxonomy or gazetteer. This makes perfect sense to me, so my proposal is that I emend my pull request from creating a teidata.enumeratedOrPointer to creating a teidata.authority (which will do the same thing in practical terms), defined as a value that derives either from a local value list or a resource from an item in a taxonomy or other authority list.

ebeshero commented 4 years ago

Council thinks HC's suggestion would work, and he should emend the pull request accordingly. There may be a significant potential problem, @sydb notes, with processing ambiguity when it is not clear whether this is a token or a URI (keeping in mind that a URI can simply take the form of a text string). This should be something clarified in the ODD schema valList.

hcayless commented 4 years ago

Ok. I have emended the PR. For the record, I know @sydb disagrees with this way of resolving the ambiguity, but I really think it's going to be fine. A downstream processor, confronted with a teidata.authority value can attempt to resolve it, if it wants, or not, if it doesn't.

sydb commented 4 years ago

Majority of Council VF2F (extension) meeting has agreed definitively with #1974 (although prefer @hcayless’ new name teidata.authority). This ticket is now green for that (modified) implementation.

I dissent, joined (I believe) in part or in whole by @npcole.

This approach makes perfect sense for any given project. The project decides to use @where on <move>, and then decides on a set of values which might be enumerated (good idea IMHO) or might be URIs (not bad if they point to widely agreed upon canonical concepts). The project knows full well which it chose, and can even document this appropriately in their ODD either with just prose or by changing the datatype of @where to whichever they are actually using.

But for a generic processor (think TAPAS, TEI Boilerplate, CETEIcean, or a generic TEI data-sanity-checker), this approach at the very least generates unnecessary headaches, and at the worst makes certain tasks impossible. This problem occurs because any URL could also serve as a value in an enumerated list. (The reverse is not true: there have always been lots of characters that are allowed in enumerated lists that are not allowed in URLs, and as of release 4.0.0 an additional 2500 or so were added, the characters in the Unicode “Marks” category.)

Thus looking at an instance of this attribute, a processor will now have to figure out which kind it is, and in some cases there will be no syntactic clues in the instance file. The processor will have to read the schema (either the ODD or the RelaxNG would do — no need to read the Schematron), if it is available. If not, you may just be out of luck.

Imagine, for example, that you are trying to write a generic TEI link checker that works on any vanilla TEI P5 4.1.0 instance document. One of its jobs is to report what percentage of linking attributes can be resolved, and what percentage cannot be found. It may have no idea whether to include the occurrences @where of <move> in the denominator. (That is, should they be checked. And I say “may have” because if every value has a character sequence that is not allowed in a URI, like ‘§’ or “%M€”, the processor might safely guess they are all enumerated values, rather than all have typos.) More importantly, when it sees a value of @where on <move> that could be either a URI or a teidata.word (that is, a value enumerated elsewhere) it does not know whether it should test the value to see if it is a resolvable pointer or not.

Imagine now the (contrived and bizarre) circumstance of a TEI instance document that has an occurrence of <move where="left"> that is being processed in an environment in which there exists both a "left" in an enumerated list of values and a file ./left, that is a file named “left” in the same directory. (Stupid name, especially if it is a PNG of a left-pointing arrow. :-) The semantic implication of the attribute now depends on whether the processor a) looks for the enumerated value first, stopping when it finds there is one, b) looks for the file first, stopping when it finds one, c) chooses the file if both are found, d) chooses the enumerated value if both are found, or e) generates an error if both are found (and furthermore whether or not the processor looks inside the targeted ./left file to find what the semantics are, but that’s true for any pointer to a file).

The same problem exists if the resources are not available, but is much less of a problem because all it means is a less helpful error message. Imagine the same TEI instance as above (with <move where="left">) but now being processed without access to its ODD (or RelaxNG schema) and displaced from its home directory. Now when the processor sees "left", it does not even have a mechanism for determining what kind of error this is (unresolvable URI or no enumeration of values). Again, not a big problem, just a weird and less helpful error message like “ambiguous value "left" cannot be found or checked" or whatever.

And remember that the processor may not be able to flag a value as invalid even when there is an enumerated list available in the ODD (or RelaxNG schema) and said value is not in the list, because the list may be "open" or "semi", allowing other values. In such a case the TEI has given no guidance as to whether or not the processor should look for a file, and either way it (again) can only generate a relatively imprecise error message.

I concede this is not the first place in TEI where we introduce ambiguity for processors. For example, is the value "isbn:19621020" a URN using the well-known "isbn" scheme, a URN using the private scheme "isbn" defined in the <prefixDef>, or perhaps a URL that refers to a file in the same directory that is named isbn:19621020? (Note for MacOS users, ‘:’ is a perfectly legal character in many filesystems. Note for all: in truth, "isbn:19621020" is not valid as a relative path URI because of this very issue — the colon would make it ambiguous — thus RFC3986 requires "./isbn:19621020" if you really want to point to a local file.).

I also concede that any project can get around the main bulk of the problems caused by this ambiguity (and the one exemplified by "isbn", above) by using "file:///left" (for an absolute path) or "./left" (for a relative path) when the desired use of @where is a URL, not an enumerated value.

Lastly and perhaps most importantly, I concede the point that this will only come up very rarely if ever for most projects, and is probably a relatively small problem for most authors of generic processing software.

So I am not worried about this becoming a problem frequently. I am terrified of having to write the apology, though, to the user that is bitten by this. What could we possibly say to defend deliberately introducing unresolvable ambiguity? We cannot say “there was no other way”, because there are several other solutions to the problem OP has (including the solution OP suggested, although it does have problems of its own, they are not insurmountable). We cannot even say “we tried to make it as reasonable as possible” because where we could at least give clean semantics to this ambiguity (e.g., as suggested “look at the enumerated value list first, then if the value is not in that list use it as a pointer iff it is an xs:anyURI” or some such — only solves some of the problems, but at least it helps) we have not.

P.S. It is worth noting that while I have at times (somewhat hesitatingly) supported variations of this idea (heck, I think I am the one who suggested it first), I have supported those variations that tried to make clear whether a given value was a URI or an enumerated value at any given time, not this “it’s your problem, bub” distillation of it.

martinascholger commented 4 years ago

This issue has been addressed with the introduction of the new datatype teidata.authority in PR https://github.com/TEIC/TEI/pull/1974

sydb commented 4 years ago

Except that it hasn’t, has it?. Seems to me the new (awful) datatype exists, but the @where attribute of <move> is still teidata.pointer.

martinascholger commented 4 years ago

The @where attribute of <move> is teidata.authority

sydb commented 4 years ago

In what branch? I did a git pull just before re-opening this, and it was not. But now it is. Bad timing, I guess. Thanks @martinascholger !