Open hcayless opened 2 years ago
It is the behavior I expect, but I can’t speak for others. This is why (IMHO) it is always good defensive programming to process normalize-space()
of an attribute value (with a few exceptions).
I'm sorry, that seems nonsensical to me. The note on teidata.word says:
Attributes using this datatype must contain a single ‘word’ which contains only letters, digits, punctuation characters, or symbols: thus it cannot include whitespace.
If we're basing it on token, then that's clearly wrong and we should either fix the datatype or the note. It can contain whitespace (albeit only at the beginning and end).
@hcayless Council is thinking we should simply revise the remarks on teidata.word to resolve this with the least disruption to the community. And we wanted to ask you if that will suffice?
I'd be more in favor of fixing the definition of teidata.word
to:
<content>
<dataRef name="string"
restriction="[^\p{C}\p{Z}]+"/>
</content>
But failing that, yes, the note must be fixed.
I am still quite uncomfortable with this. (I self-assigned this? What was I thinking?)
The problem, as I see it, is that if we add prose in the <remarks>
of teidata.word that say “BTW, there can be space before & after the token itself, there just cannot be space inside the token” or whatever, then wouldn’t we feel compelled to add that prose to lots of other attributes, too? After all, most of the attributes in the TEI system share this same quality (he said, avoiding the term “feature” or “bug”).
Take, for example, the following TEI document. It is valid even though there is leading & trailing whitespace to values of at least teidata.name, teidata.count, teidata.language, xs:nonNegativeInteger, teidata.enumerated, teidata.numeric, teidata.certainty, teidata.duration.iso, and teidata.durtion.w3c.
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>A little test file</title>
<title type="purpose">to demonstrate valid leading & trailing spaces in attr
values</title>
</titleStmt>
<publicationStmt>
<p>Publication Information with
a <graphic url="./duck.jpg" width=" 36mm " height=" 24mm "/>
stuck in the middle for no reason.</p>
</publicationStmt>
<sourceDesc>
<p>Information <said direct=" false ">about</said> the source as of <date
dur-iso=" PT86400S "
dur=" PT86400.0S "
precision=" high "
unit=" s "
quantity=" 86400 "
scope=" notMuch ">today</date></p>
</sourceDesc>
</fileDesc>
<profileDesc>
<langUsage>
<language ident=" en-US " usage=" 99 "/>
<language ident=" " usage=" 01 "/>
</langUsage>
</profileDesc>
<encodingDesc>
<tagsDecl>
<namespace name="https://example.bauman.zapto.org/ns">
<tagUsage gi=" elName " occurs=" 7 " withId=" 005 "/>
</namespace>
</tagsDecl>
</encodingDesc>
</teiHeader>
<text>
<body>
<p>The vast majority of the attrs have whitespace, but this document is valid against
tei_all.</p>
</body>
</text>
</TEI>
Seems to me the place this kind of information belongs is in SG. (And I have checked, there is no mention of it there.) Probably in #SG-att, to be precise.
I continue to maintain that basing the datatype on token
was an error that we should fix. Tokens don't work the way we'd want them to. This is bad and basically leaves a trap for the writers of TEI processing software to fall into.
Exhibit A: teidata.word, which has a restriction disallowing all control characters and separators and a note saying it can't contain whitespace.
Exhibit B: the following valid (according to Jing) TEI file (note the value of the type attribute on the
<ab>
:RelaxNG
token
s are expected to have their space normalized before their values are checked, thus whitespace before and/or after the attribute value is ignored for the purposes of validation. This is certainly not what we want forteidata.word
according to the remarks, and possibly not for many of the token types we use. I thinkteidata.word
at least should probably be changed to a string with the same restriction it has now.Or is this expected behavior? It's been this way for a long time...