Closed Conal-Tuohy closed 4 years ago
My recommendation would be to shun external character entities in favour of
æ
), which only requires encoders to have a good font set up in their editor, <g>
elements, which are still fairly concise. There are some practical advantages to having "standalone" XML files (i.e. with no dependency on an external DTD).
But named character entities can still be used in P5, if that's your preference.
Note that the migration from P4 to P5 does necessarily lose the character references, because the migration is performed in XSLT; the XML documents are already parsed when they arrive as input to the XSLT, and the character references replaced with the appropriate Unicode character, so if character entity references are desired in the P5, we would need to regenerate them.
I have also noticed a number of "special" characters which have been encoded using codepoints from the Private Use Area, which are in fact already defined in Unicode.
Can someone be tasked with going through the list of PUA codes used and checking each one to ensure that it's strictly necessary?
The U MACRON
char is one redundantly defined char, for example. I believe it ought to be encoded as "Ū"; U+016A LATIN CAPITAL LETTER U WITH MACRON
. Additionally, there are characters which don't have a single Unicode codepoint, but which could be encoded using a combining character, e.g. the various consonant letters with macrons, such as H MACRON
and P MACRON
, which could be encoded using a combination of "H" or "P" with U+0304 COMBINING MACRON
to give H̄ and P̄.
The Latin special characters in the PUA area have overlines, not macrons, which mark omission rather than vowel length or stress. The thing about macrons from a typography point-of-view is that they are drawn relative to the height of the character in the cell, so if an h had a macron (and they don't) its macron would not align with a u-macron, but would ride higher in the cell.
The thing about the overlines is that they are a single stroke and so the lines are all drawn at the same height regardless of the character. Compare the u-macron and the u-overline.
This was an issue several years ago, and those PUA characters should remain.
I'm not sure I fully understand your comment, to be honest, Wally. Are you saying that the character entity reference &u-macron;
used in the corpus actually refers to a "u" with an overline, and not a macron? Could we use a U+0305 COMBINING OVERLINE
to compose these graphemes? Is that invalid from a semantic point of view?
This is taking me back to a time in the distant past when I wrote some typographical software applications including for modifying various kinds of fonts to introduce macrons over vowels, to support publishing in the Māori language.
Not sure where we are with this issue, @wehooper?
Again, not sure where we are. I am tagging this issue as future so @wehooper can clarify to address later.
[Under P5 we're supposed to prefer the unicode over named entities, and, if I recall, they present security hazards. But named entities can be defined in P5 schemas (I think) and the medievalists (see article, section 11 ff, and the Conclusion) make a pretty good case for using them. -W] [If they are legal, we get a lot of good use out of them. Our transcriptions are more legible with the entities than with the fully written written TEI, and we have control over the meaning, so encoders can work without those concerns. -W] [If we can keep the entities, maybe Con should leave them in situ, and take advantage of the control they afford, and transform the otherwise visible P4 around them to P5. -W]