TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
269 stars 88 forks source link

Datatypes using 'token' permit whitespace even when they probably shouldn't #2370

Open hcayless opened 1 year ago

hcayless commented 1 year ago

Exhibit A: teidata.word, which has a restriction disallowing all control characters and separators and a note saying it can't contain whitespace.

Exhibit B: the following valid (according to Jing) TEI file (note the value of the type attribute on the <ab>:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" version="4.5.0 ">
  <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Publication Information</p>
         </publicationStmt>
         <sourceDesc>
            <p>Information about the source</p>
         </sourceDesc>
      </fileDesc>
  </teiHeader>
  <text>
      <body>
         <ab type=" foo ">Some text here.</ab>
      </body>
  </text>
</TEI>

RelaxNG tokens are expected to have their space normalized before their values are checked, thus whitespace before and/or after the attribute value is ignored for the purposes of validation. This is certainly not what we want for teidata.word according to the remarks, and possibly not for many of the token types we use. I think teidata.word at least should probably be changed to a string with the same restriction it has now.

Or is this expected behavior? It's been this way for a long time...

sydb commented 1 year ago

It is the behavior I expect, but I can’t speak for others. This is why (IMHO) it is always good defensive programming to process normalize-space() of an attribute value (with a few exceptions).

hcayless commented 1 year ago

I'm sorry, that seems nonsensical to me. The note on teidata.word says:

Attributes using this datatype must contain a single ‘word’ which contains only letters, digits, punctuation characters, or symbols: thus it cannot include whitespace.

If we're basing it on token, then that's clearly wrong and we should either fix the datatype or the note. It can contain whitespace (albeit only at the beginning and end).

ebeshero commented 1 year ago

@hcayless Council is thinking we should simply revise the remarks on teidata.word to resolve this with the least disruption to the community. And we wanted to ask you if that will suffice?

hcayless commented 1 year ago

I'd be more in favor of fixing the definition of teidata.word to:

<content>
 <dataRef name="string"
  restriction="[^\p{C}\p{Z}]+"/>
</content>

But failing that, yes, the note must be fixed.

sydb commented 10 months ago

I am still quite uncomfortable with this. (I self-assigned this? What was I thinking?)

The problem, as I see it, is that if we add prose in the <remarks> of teidata.word that say “BTW, there can be space before & after the token itself, there just cannot be space inside the token” or whatever, then wouldn’t we feel compelled to add that prose to lots of other attributes, too? After all, most of the attributes in the TEI system share this same quality (he said, avoiding the term “feature” or “bug”).

Take, for example, the following TEI document. It is valid even though there is leading & trailing whitespace to values of at least teidata.name, teidata.count, teidata.language, xs:nonNegativeInteger, teidata.enumerated, teidata.numeric, teidata.certainty, teidata.duration.iso, and teidata.durtion.w3c.

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>A little test file</title>
        <title type="purpose">to demonstrate valid leading &amp; trailing spaces in attr
          values</title>
      </titleStmt>
      <publicationStmt>
        <p>Publication Information with
          a <graphic url="./duck.jpg" width=" 36mm " height=" 24mm "/>
          stuck in the middle for no reason.</p>
      </publicationStmt>
      <sourceDesc>
        <p>Information <said direct="  false ">about</said> the source as of <date
            dur-iso=" PT86400S "
            dur=" PT86400.0S "
            precision=" high "
            unit=" s "
            quantity=" 86400 "
            scope=" notMuch ">today</date></p>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <langUsage>
        <language ident=" en-US " usage=" 99 "/>
        <language ident="       " usage=" 01 "/>
      </langUsage>
    </profileDesc>
    <encodingDesc>
      <tagsDecl>
        <namespace name="https://example.bauman.zapto.org/ns">
          <tagUsage gi=" elName " occurs=" 7 " withId="  005 "/>
        </namespace>
      </tagsDecl>
    </encodingDesc>
  </teiHeader>
  <text>
    <body>
      <p>The vast majority of the attrs have whitespace, but this document is valid against
        tei_all.</p>
    </body>
  </text>
</TEI>

Seems to me the place this kind of information belongs is in SG. (And I have checked, there is no mention of it there.) Probably in #SG-att, to be precise.

hcayless commented 10 months ago

I continue to maintain that basing the datatype on token was an error that we should fix. Tokens don't work the way we'd want them to. This is bad and basically leaves a trap for the writers of TEI processing software to fall into.