TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
275 stars 88 forks source link

Inconsistency in character representation #1966

Open sydb opened 4 years ago

sydb commented 4 years ago

We are inconsistent about how we refer to non-printed characters. I have not delved deep enough yet to know if the difference is appropriate or inappropriate.

We refer to non-printed characters using entity reference notation a dozen times,[1] and using Unicode notation 21 times.[2]


[1]

$ egrep -c 'amp;#x?[0-9a-fA-F]' *.html | notzero 
CH.html:1
examples-char.html:1
examples-charName.html:1
examples-charProp.html:1
examples-localName.html:1
examples-mapping.html:1
examples-unicodeName.html:1
examples-value.html:1
SG.html:2
WD.html:2

[2]

$ egrep -c 'U\+[0-9a-fA-F]' *.html | notzero
CH.html:1
CO.html:1
examples-charDecl.html:1
examples-char.html:1
examples-charName.html:1
examples-charProp.html:1
examples-glyph.html:1
examples-glyphName.html:1
examples-localName.html:1
examples-mapping.html:1
examples-value.html:1
HD.html:1
PH.html:1
ref-teidata.language.html:1
SG.html:1
ST.html:1
WD.html:5
PFSchaffner commented 4 years ago
  1. I suppose technically the entity invokes the character, the U+ notation references the code point, which might be significant in some contexts, e.g. one in which you are specifying how to represent a character (entity) and one in which you are discussing the fine distinction between different code points (U+).
  2. Assuming that once one has applied the test in (1), inconsistency remains...Does such minor inconsistency matter?
  3. If either will do, I'd plump for the U+ notation as less likely to cause display problems.
hcayless commented 4 years ago

The definition of mapping seems to me like the U+xx notation would be incorrect as content, and I see at least one of those.

duncdrum commented 4 years ago

There are subtle differences in entity notation and Unicode notation, when it comes to xml processing via path (expand) or xslt/xquery (retain). I'd say leave it up to implementers to pick the right one for their use case. <mapping> is on its way out, iirc.

martinascholger commented 3 years ago

VF2F agreed that @npcole will work on prose for explaining when U+hex will be interpreted as a character point, and any other text will be interpreted literally.