IUBLibTech / newton_chymistry

New version of 'The Chymistry of Isaac Newton', using XProc pipelines to generate a website based on TEI XML encodings of Newton's alchemical manuscripts, and Apache Solr as a search engine.
2 stars 0 forks source link

How to encode special characters? #20

Closed Conal-Tuohy closed 4 years ago

Conal-Tuohy commented 5 years ago

[Under P5 we're supposed to prefer the unicode over named entities, and, if I recall, they present security hazards. But named entities can be defined in P5 schemas (I think) and the medievalists (see article, section 11 ff, and the Conclusion) make a pretty good case for using them. -W] [If they are legal, we get a lot of good use out of them. Our transcriptions are more legible with the entities than with the fully written written TEI, and we have control over the meaning, so encoders can work without those concerns. -W] [If we can keep the entities, maybe Con should leave them in situ, and take advantage of the control they afford, and transform the otherwise visible P4 around them to P5. -W]

Conal-Tuohy commented 5 years ago

My recommendation would be to shun external character entities in favour of

There are some practical advantages to having "standalone" XML files (i.e. with no dependency on an external DTD).

But named character entities can still be used in P5, if that's your preference.

Note that the migration from P4 to P5 does necessarily lose the character references, because the migration is performed in XSLT; the XML documents are already parsed when they arrive as input to the XSLT, and the character references replaced with the appropriate Unicode character, so if character entity references are desired in the P5, we would need to regenerate them.

Conal-Tuohy commented 5 years ago

I have also noticed a number of "special" characters which have been encoded using codepoints from the Private Use Area, which are in fact already defined in Unicode.

Can someone be tasked with going through the list of PUA codes used and checking each one to ensure that it's strictly necessary?

The U MACRON char is one redundantly defined char, for example. I believe it ought to be encoded as "Ū"; U+016A LATIN CAPITAL LETTER U WITH MACRON. Additionally, there are characters which don't have a single Unicode codepoint, but which could be encoded using a combining character, e.g. the various consonant letters with macrons, such as H MACRON and P MACRON, which could be encoded using a combination of "H" or "P" with U+0304 COMBINING MACRON to give H̄ and P̄.

wehooper commented 5 years ago

The Latin special characters in the PUA area have overlines, not macrons, which mark omission rather than vowel length or stress. The thing about macrons from a typography point-of-view is that they are drawn relative to the height of the character in the cell, so if an h had a macron (and they don't) its macron would not align with a u-macron, but would ride higher in the cell.

The thing about the overlines is that they are a single stroke and so the lines are all drawn at the same height regardless of the character. Compare the u-macron and the u-overline.

This was an issue several years ago, and those PUA characters should remain.

Conal-Tuohy commented 5 years ago

I'm not sure I fully understand your comment, to be honest, Wally. Are you saying that the character entity reference &u-macron; used in the corpus actually refers to a "u" with an overline, and not a macron? Could we use a U+0305 COMBINING OVERLINE to compose these graphemes? Is that invalid from a semantic point of view?

This is taking me back to a time in the distant past when I wrote some typographical software applications including for modifying various kinds of fonts to introduce macrons over vowels, to support publishing in the Māori language.

mdalmau commented 4 years ago

Not sure where we are with this issue, @wehooper?

mdalmau commented 4 years ago

Again, not sure where we are. I am tagging this issue as future so @wehooper can clarify to address later.