strip accents from Unicode characters if they are not wrapped in corresponding <hi rend> elements

andreabernini commented 9 years ago

P.Oxy. 67.4554 back (TM 68627) ln. 2 https://github.com/DCLP/idp.data/blob/master/DCLP/69/68627.xml#L106 (back, line 2) The word has two accents, but only the first one was written by the scribe. However, in the apparatus both are displayed

Leiden+: Λαπ̣[ ί(´)]θ̣[αισ]ί̣. Text: Λαπ̣[ί(*)]θ̣[αισ]ί̣ App.: π̣ίθ̣[αισ]ί̣ papyrus

Expected in the apparatus: π̣ίθ̣[αισ]ι̣ papyrus (i.e., no acute accent on the final iota of this word)

paregorios commented 7 years ago

The corresponding XML is:

Λα<unclear>π</unclear><supplied reason="lost"><hi 
    rend="acute">ί</hi></supplied><unclear>θ</unclear><supplied 
    reason="lost">αισ</supplied><unclear>ί</unclear>

paregorios commented 7 years ago

Current transform gives us:

[πείθομαι καὶ σὺ]ν̣ Λαπ̣[ί(*)]θ̣[αισ]ί̣ σ̣ε̣ Κ̣ε̣ν̣ etc.

with apparatus:

back.2. π̣ίθ̣[αισ]ί̣ papyrus

In other words, it looks to me like this is still broken as described above.

paregorios commented 7 years ago

The first task, it seems to me, is to investigate whether apparatus creation is presently smart enough to strip accents from Unicode characters if they are not wrapped in corresponding <hi rend> elements.

wsalesky commented 7 years ago

@paregorios In the above comment does 'apparatus creation' refer to the XSLT transformation? If so I do think the XSLT can be tweaked to correctly strip accents. Currently this output is handled by epidoc-xslt/teiunclear.xsl in the following template <xsl:template match="t:unclear">

This template uses normalize-unicode($text-content) and is currently not specifying a unicode normalization form, if no normalization format is specified 'NFC' is used as the default. I think we may be better off with 'NFKD' although you may have more experience with the normalization forms then I am. Choices available to normalize-unicode() are: NFC, NFD, NFKC, NFKD, and FULLY-NORMALIZED

Thoughts?

rla2118 commented 7 years ago

Question about the XML here, <supplied reason="lost"><hi rend="acute">ί</hi></supplied> Hi rend isn't likely to be supplied. How do we know that it was lost? What TM number is this?

wsalesky commented 7 years ago

I think this is the line in the data: https://github.com/DCLP/idp.data/blob/master/DCLP/69/68627.xml#L106

jcowey commented 7 years ago

If you @rla2118 look on http://163.1.169.40/gsdl/collect/POxy/index/assoc/HASHd676/fccbe4ae.dir/POxy.v0067.n4554.a.across.hires.jpg you will see an acute accent visible over a supplemented letter (which is not visible). Not sure whether the XML is brilliant for this. But it works in Leiden+ and is validated by the editor.

paregorios commented 7 years ago

Ignoring for the moment the XML question raised by @rla2118 which is tangential to the focus on this ticket ...

Regarding the question above from @wsalesky I'd like to pull in @hcayless : what is the assumed normalization form for papyri.info content?

hcayless commented 7 years ago

The assumed normalization is NFC. That may no longer be necessary—fonts and font rendering have gotten a lot more sophisticated since we first made that decision. So NFD might be fine. I believe we were afraid NFK(C|D) might erase distinctions that the papyrologists care about.

paregorios commented 7 years ago

@wsalesky by "apparatus" above I meant to call out a particular section of the HTML output created by the XSLT, specifically <div id="apparatus">.

I think that what's triggering generation of an apparatus entry in the case of this particular word (πίθαισι) is the presence of <hi rend="acute">. That's interesting to the papyrologists because, although the Greek accent marks were invented in the Hellenistic period, they were not ever (as I understand it) applied always or consistently thereafter. But it has been conventional in modern scholarly editions to impose the regular accentuation system on all texts. So, when an accent actually occurs in an ancient textual witness, the papyrologist wants to call attention to it. We do this in the XML with <hi rend="acute"> and in print representation with an apparatus entry (post-fixed with the word "papyrus" to signal we're now strictly transcribing said witness) that prints only those accents appearing originally.

@andreabernini's concern is that the accents included in the main text by the papyrologist/encoder are being printed in the corresponding apparatus entry. They need to be stripped out and accents rendered only when there's a corresponding <hi rend.

wsalesky commented 7 years ago

Can I get some other problematic records to test this on?

paregorios commented 7 years ago

So, here is a raft of examples that address edge cases that might be affected by our fixes. I'm pulling together examples of our problem now.

I find only two examples in DCLP of this particular behavior involving the acute accent:

TM68627 (the one that triggered this ticket)
TM62564, where POxy5039 fragment 2 has:

<lb n="2"/><supplied reason="lost">δ’ ἐν δρόμωι κάρυξ ἀνέειπ<hi rend="acute">έ</hi> </supplied> μιν ἀ<supplied reason="lost">γ</supplied>

Transformed text: [δ’ ἐν δρόμωι κάρυξ ἀνέειπέ(*) ] μιν ἀ[γ-] Apparatus entry: ^ POxy5039.2.2. ανεειπέ papyrus

There are no examples involving grave or circumflex accents (there are a few further such <hi rend elements inside <supplied reason="lost">, but they have <gap> inside them instead of an accented character). I note also that DDB has a couple of examples of <supplied reason="lost">//<hi rend> involving accents, but they don't accent the character contained within (T.Vindol. 2 194, side B, line 1 and O.Claud. 4 845, line 3), but I find the omission of any indication of the editorial restoration in the latter example odd (and possibly wrong).

It occurs to me that we should look at the Greek "rough" and "smooth" breathings (spiritus "asper" and spiritus "lenis") in addition to the accents. Here are some straightforward "asper" examples drawn from TM370041:

<div n="1+2+5+6+7" subtype="fragment" type="textpart">
⋮
    <lb n="13"/><supplied reason="lost"><hi rend="asper">ὁ</hi></supplied>  ...
⋮
    <lb n="21"/><q><supplied reason="lost">καθὼς</supplied> αυ<unclear>τ</unclear>ο<gap reason="illegible" quantity="2" unit="character"/><supplied reason="lost"><hi rend="asper">ὁ</hi>

Transformed text:

[ὁ(*)] π̣(ατὴ)ρ ἡ̣[μ]ῶν ((mese-stigme)) καὶ ἐ(*)π̣ὶ̣[- ca.9 -]  ̣[- ca.10 -]
⋮
'[καθὼς] αυτ̣ο  ̣  ̣[ὁ(*) κ(ύριο)ς ἡμῶ]ν̣ κ̣αὶ θ(εὸ)ς Ἰ(ησοῦ)ς ὁ Χ(ριστὸ)ς [ -5-6- ]

Apparatus entries:

^ 1+2+5+6+7.13. ὁ papyrus
⋮
^ 1+2+5+6+7.21. [ημω]ὁ papyrus

But the (meagre) examples of <hi rend="lenis"> I find inside <supplied reason="lost"> seem weird to me. Perhaps @jcowey or @HolgerEssler or @rla2118 can explain what's going on in this one. Based on the examples we're considering in this ticket, I'd have thought the Unicode diacritical mark would be here in addition to the <hi rend> (but see comment on DDB encoding examples immediately above):

TM63423

<lb n="3"/>Χαιρήμων <num value="1609"><hi rend="lenis">Α</hi>χθ</num>
<lb n="4"/>ὁ εὐγενέστατος <num value="1609"><hi rend="lenis">Α</hi>χθ</num>

Transformed text:

Χαιρήμων Α(*)χθ
ὁ εὐγενέστατος Α(*)χθ

Apparatus entries:

^ 3. ἀχθ papyrus
^ 4. ἀχθ papyrus

It occurs to me to throw diaeresis into the mix here as well, especially when we find some examples of its indication in the context of a character with which accent or breathing is Unicode-encoded. E.g.:

TM119313, folio 8b, column i, line 19:

<lb n="19" break="no"/><supplied reason="lost">τοὺς ἔσυρον <hi rend="diaeresis">Ἰ</hi>άσ</supplied>ο

Transformed text: [τοὺς ἔσυρον Ἰ(*)άσ]ο- Apparatus entry: ^ 8b.i.19. ϊασ papyrus

TM59332, folio Er, lines 3 and 13:

<lb n="3" break="no"/>ώσας ἀνάλαβ<supplied reason="lost">ε <app type="alternative"><lem><choice><reg>ποιεῖ</reg><orig>ποι<hi rend="diaeresis">ῖ</hi></orig></choice> δὲ</lem><rdg>χρῶ δὲ</rdg><rdg>καὶ χρῶ</rdg></app> καὶ</supplied>
⋮
<lb n="13"/><supplied reason="lost">ἰ</supplied><unclear>ο</unclear>ῦ ξυστοῦ <expan><ex>οὐγκ </ex></expan> <gap reason="lost" quantity="1" unit="character"/> <supplied reason="lost"><choice><reg>ποιεῖ</reg><orig>ποι<hi rend="diaeresis">ῖ</hi></orig></choice> δὲ καὶ</supplied>

Transformed text:

ώσας ἀνάλαβ[ε ποιῖ(*) δὲ(*) καὶ]
⋮
[ἰ]ο̣ῦ ξυστοῦ (οὐγκ ) [  ̣ ποιῖ(*)(*) δὲ καὶ]

Apparatus entries:

^ E.r.3. l. ποιεῖ, or χρῶ δὲ, or καὶ χρῶ : ποιϊ papyrus
⋮
^ E.r.13. l. ποιεῖ : ποιϊ papyrus

Compare folio Br, line 3 in which we encounter <hi rend="diaeresis"> wrapping a character without an accent.

TM60191, side B, fragment b, line 4:

<lb n="4" rend="ekthesis"/><app type="editorial"><lem>Δ<unclear>α</unclear></lem><rdg resp="ed.pr."><unclear>Δ</unclear></rdg></app><unclear>βα</unclear><supplied reason="lost">ρυηκο<hi rend="diaeresis">ί</hi>αι. τὰς κυούσας φαρμακεύ</supplied>

Transformed text: Δα̣(*)β̣α̣[ρυηκοί(*)αι. τὰς κυούσας φαρμακεύ-] Apparatus entry: ^ B.b.4. ϊαι. papyrus

TM60193, fragment a, side A, line 13:

<lb n="13"/><unclear>ὁκ</unclear><supplied reason="lost">όσ</supplied><unclear>α</unclear><supplied reason="lost">ι</supplied> καθύγρους <supplied reason="lost">ἔχου</supplied>σιν τὰς μ<supplied reason="lost">ήτρας οὐ κυ<hi rend="diaeresis">ί</hi>σκουσιν.</supplied>

Transformed text: ὁ̣κ̣[όσ]α̣[ι] καθύγρους [ἔχου]σιν τὰς μ[ήτρας οὐ κυί(*)σκουσιν.] Apparatus entry: ^ a.A.13. κυϊσκουσιν. papyrus

wsalesky commented 7 years ago

Restating the problem

In Apparatus creation strip accents from Unicode characters if they are not wrapped in corresponding elements.

XML Changes:

epidoc-xslt/tpl-apparatus.xsl
- in template: recurse_forward
- added a call to <xsl:call-template name="trans-string"> to handle all non-specified elements.

NOTE: We may also need to add it to 'recurse_back' if this code does seem to act as expected.

Some results: From the triggering record: https://github.com/DCLP/idp.data/blob/master/DCLP/69/68627.xml#L106 back.2. π̣ίθ̣[αισ]ι̣ papyrus

From git diff:

diff --git a/output/dclp/113/112358.html b/output/dclp/113/112358.html
index 42a5f57a3..b4b420a7a 100644
--- a/output/dclp/113/112358.html
+++ b/output/dclp/113/112358.html
@@ -131,7 +131,7 @@

- <span>ϊωά̣ν̣[νου]
+ <span>ϊωα̣ν̣[νου]

diff --git a/output/dclp/113/112359.html b/output/dclp/113/112359.html
index c445eacbc..00f040009 100644
--- a/output/dclp/113/112359.html
+++ b/output/dclp/113/112359.html
@@ -131,7 +131,7 @@

- <span>ϋμ̣ῶ̣ν
+ <span>ϋμ̣ω̣ν

- <span>βηθ’σαϊδ̣[ά-
+ <span>βηθ’σαϊδ̣[α-

Some possible issue:

DCLP/114/113264.xml

- <span>πολέα[ς]
+ <span>πολέα[σ]

relevant xml: <hi rend="acute">έ</hi>α<supplied reason="lost">ς</supplied>

It looks like this may be a problem with the translate string. (epidoc-xslt/global-varsandparams.xsl)

And

diff --git a/output/dclp/129/128953.html b/output/dclp/129/128953.html
index 5261c6fd2..7a16caa5e 100644
--- a/output/dclp/129/128953.html
+++ b/output/dclp/129/128953.html
@@ -121,10 +121,10 @@

added ‘-‘ for 
- κληρονομ[ί]<br
+ κληρονομ[ί-]<br

Relevant xml:

<lb n="2"/>σου<g type="mese-stigme"/> <expan>κ<ex>αὶ</ex></expan> εὐλόγησ<supplied reason="lost">ον</supplied>
<lb n="3"/>τὴν κληρονομ<supplied reason="lost">ί</supplied>

@paregorios ready for review, and possibly further discussion.

paregorios commented 7 years ago

@wsalesky yeah, well spotted. Something's changing a final sigma (ς) in the XML into a medial sigma (σ) and that's not going to fly with the papyrologists. More soon.

wsalesky commented 7 years ago

It is here: epidoc-xslt/global-varsandparams.xsl

<xsl:variable name="all-grc">
<xsl:text>abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZἀἁἂἃἄἅἆἇἈἉἊἋἌἍἎἏΑὰάᾀᾁᾂᾃᾄᾅᾆᾇᾈᾉᾊᾋᾌᾍᾎᾏᾲᾳᾴᾶᾷάέΕἐἑἒἓἔἕἘἙἚἛἜἝὲέΗήἠἡἢἣἤἥἦἧἨἩἪἫἬἭἮἯᾐᾑᾒᾓᾔᾕᾖᾗᾘᾙᾚᾛᾜᾝᾞᾟῂῃῄῆῇὴήΙίϊἰἱἲἳἴἵἶἷἸἹἺἻἼἽἾἿὶίῒΐΐῖῗΟόὀὁὂὃὄὅὈὉὊὋὌὍὸό΅ύὐὑὒὓὔὕὖὗὙὛὝὟὺύῢΰΰῦῧϋΩώὠὡὢὣὤὥὦὧὨὩὪὫὬὭὮὯὼώᾠᾡᾢᾣᾤᾥᾦᾧᾨᾩᾪᾫᾬᾭᾮᾯῲῳῴῶῷςῤῥαβγδεζηθικλμνξοπρστυφχψωῬΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ</xsl:text>  
</xsl:variable>

Replaced by (using translate()):

<xsl:variable name="grc-lower-strip">
<xsl:text>abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzαααααααααααααααααααααααααααααααααααααααααεεεεεεεεεεεεεεεεηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηιιιιιιιιιιιιιιιιιιιιιιιιιιοοοοοοοοοοοοοοοουυυυυυυυυυυυυυυυυυυυυυωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωσρραβγδεζηθικλμνξοπρστυφχψωραβγδεζηθικλμνξοπρστυφχψω</xsl:text>
</xsl:variable>

Position 290 (when opened in oXygen) in the first string is 'ς' and the corresponding place in the second, replacement string is 'σ' I don't know if changing this here will cause mayhem elsewhere, as this does seem to be a deliberate substitution.

paregorios commented 7 years ago

@wsalesky can you conduct a code census to see where else in the XSLT the named "trans-string" template is called? That would give us the visibility necessary to figure out (hopefully) whether this is just a universal bug or if making the change we're contemplating would break something else.

wsalesky commented 7 years ago

@paregorios Sure. I ran a search across all the stylesheets for the use of the "$all-grc" variable. This variable is used exclusively in the translate() function either via the "trans-string" template or as part of the text processing of other templates. Here is where these variables are used:

XSLT epidoc-xsl/tpl-apparatus.xsl Makes heavy use of this via Line 1400:

 <xsl:template name="trans-string">
      <!-- transforms context of <hi> into lowercase unaccented for rendering in app -->
      <xsl:param name="trans-text" select="."/>
      <xsl:value-of select="translate($trans-text, $all-grc, $grc-lower-strip)"/>
   </xsl:template>

This named template is called 21 times in this stylesheet. This is where the diacritics are stripped out for apparatus display (i.e. what we are targeting in this ticket. )

epidoc-xsl/teiunclear.xsl Called in: <xsl:template match="t:unclear">

epidoc-xsl/teiorig.xsl Called in: <xsl:template match="t:orig[not(parent::t:choice)]//text()" priority="1">

epidoc-xsl/teichoice.xsl Called in: <xsl:template match="t:choice">

epidoc-xsl/top-text.xsl Used on line 13, however I do not think this template is used by our data, nor do I think it would be effected by the change.

paregorios commented 7 years ago

I'm taking this to do a code review.

paregorios commented 7 years ago

I talked to @hcayless about this and he thinks that the sigma substitution may reflect intentional practice. He suggests we consult papyrologists, which I will do via email.

paregorios commented 7 years ago

I've written to @rla2118 @rogerbagnall @jcowey and @jds15 as follows:

Is there any sort of established practice whereby a final sigma rendered in an edition might be rendered as a medial sigma in apparatus?

This question arose in the course of testing a fix to a transformation problem involving accents actually written by the scribe. Winona discovered that a conversion template was not being invoked in the particular edge case Andrea Bernini was testing (apparently a long-standing logical oversight in the IDP-vintage EpiDoc stylesheets that is readily fixed). But while inspecting this conversion template, she discovered that (in generating the apparatus) the template not only strips unwanted accents, but also converts final sigma to medial sigma. So the question is: is this a mistake or is it by design?

rogerbagnall commented 7 years ago

I agree with Josh; but it is a good argument for lunate sigma in apparatus.

Sent from my iPhone

On Jun 2, 2017, at 1:17 PM, Tom Elliott notifications@github.com wrote:

I've written to @rla2118 https://github.com/rla2118 @rogerbagnall https://github.com/rogerbagnall @jcowey https://github.com/jcowey and @jds15 https://github.com/jds15 as follows:

Is there any sort of established practice whereby a final sigma rendered in an edition might be rendered as a medial sigma in apparatus?

This question arose in the course of testing a fix to a transformation problem involving accents actually written by the scribe. Winona discovered that a conversion template was not being invoked in the particular edge case Andrea Bernini was testing (apparently a long-standing logical oversight in the IDP-vintage EpiDoc stylesheets that is readily fixed). But while inspecting this conversion template, she discovered that (in generating the apparatus) the template not only strips unwanted accents, but also converts final sigma to medial sigma. So the question is: is this a mistake or is it by design?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DCLP/dclpxsltbox/issues/163#issuecomment-305855712, or mute the thread https://github.com/notifications/unsubscribe-auth/AGlynKPyxazFpq9_9Vd1IDSo9Ow_2Ofgks5sAEOPgaJpZM4Gj4nm .

paregorios commented 7 years ago

@jds15 responded via email as follows (to which @rogerbagnall refers above):

by design, i am quite sure

Estd practice, i dont know. But dictating logic, sure.

Scribe writes τασ and corrects to τησ ; we print τῆς in text and indicate corr ex τασ in app (we decided that it would be less true to say “corr. ex τας”); IOW terminal sigma in that case is like a diacritical.

paregorios commented 7 years ago

To which I replied:

So, if the scribe had actually used terminal sigma, would XML markup have been used to signal that fact?

jds15 commented 7 years ago

no.

-- Associate Professor in Class ical Studies & History, Duke University | Duke Collaborat ory for Classics Computing | Greek, Roman and Byzantine S tudies | Duke Data Bank of D ocumentary Papyri | papyri.i nfo | people.duke.edu/~jds15http://people.duke.edu/~jds15

On Jun 2, 2017, at 3:40 PM, Tom Elliott notifications@github.com<mailto:notifications@github.com> wrote:

To which I replied:

So, if the scribe had actually used terminal sigma, would XML markup have been used to signal that fact?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_DCLP_dclpxsltbox_issues_163-23issuecomment-2D305890552&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=GWmHrFDZvSPNmQnfB_uD9Q&m=nwUVn-YibljzrHTcftLzrTFHGQ-NP27qO-fR8e_87ww&s=sNpKaIyeXeOjgGeKu5rfObPrI5aSfa5E7rvbtBSvg-E&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACBqc9SmcmBftUjPo6ojSHCgovPawPceks5sAGUQgaJpZM4Gj4nm&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=GWmHrFDZvSPNmQnfB_uD9Q&m=nwUVn-YibljzrHTcftLzrTFHGQ-NP27qO-fR8e_87ww&s=R5iWaUv5XMjNTdbzVzMfjy8ccrVQ6Wpmh3RxIH2Hh7s&e=.

paregorios commented 7 years ago

@wsalesky so the upshot is that this sigma substitution is by design and therefore we should not alter the content of the variable grc-lower-strip. Given that's the case, is the relevant issue branch ready for review?

Separately, I'll undertake to keep the discussion moving (under a separate ticket) about whether we should be switching to lunate sigma etc. as suggested by @rogerbagnall above.

paregorios commented 7 years ago

For continuation of the lunate sigma discussion, see now #280.

wsalesky commented 7 years ago

Yes, the branch is ready for review.

paregorios commented 7 years ago

@wsalesky thanks

paregorios commented 7 years ago

this is resolved

DCLP / dclpxsltbox

strip accents from Unicode characters if they are not wrapped in corresponding <hi rend> elements #163