FrankensteinVariorum / TAN-2021

TAN version 2021: customization experiment for the Frankenstein Variorum Project
Other
0 stars 0 forks source link

Generate app and rdgGrp structure from c and u #3

Open ebeshero opened 2 years ago

ebeshero commented 2 years ago

Find where the <c> and <u> collation structure is generated. Can this be modified in the TAN collate source to rename <u> as <rdgGrp, and (crucially) bundle the moments of divergence in <app>? Without this, the <rdgGrp>s are just sibling elements, without an indication of their delta relationship (or the delta is just implied), and they're harder to process. Having to bundle them up after the collation will be more challenging than capturing them as they're formed. Why challenging? Because we cannot always expect just one set of related <u> elements in between each <c>.

@Arithmeticus This is probably the most important question I have now for applying tan collate() in our workflow. Can you help?

Example (this isn't the best example because the <u> siblings are members of the same divergent group, but imagine if there were three or four sets of <u>s generated between a <c>.

TAN collation output

<u>
         <txt>&lt;p/&gt; &lt;p/&gt;Mr</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> Mr</txt>
         <wit ref="msColl_C27" pos="3004"/>
      </u>
      <c>
         <txt>. </txt>
         <wit ref="1818_fullFlat_C27" pos="3544"/>
         <wit ref="Thomas_fullFlat_C27" pos="3556"/>
         <wit ref="1823_fullFlat_C27" pos="3542"/>
         <wit ref="1831_fullFlat_C27" pos="3546"/>
         <wit ref="msColl_C27" pos="3007"/>
      </c>
      <u>
         <txt>Kirwin,</txt>
         <wit ref="1818_fullFlat_C27" pos="3546"/>
         <wit ref="Thomas_fullFlat_C27" pos="3558"/>
         <wit ref="1823_fullFlat_C27" pos="3544"/>
         <wit ref="1831_fullFlat_C27" pos="3548"/>
      </u>
      <u>
         <txt>Kirwin</txt>
         <wit ref="msColl_C27" pos="3009"/>
      </u>
      <c>
         <txt> on hearing this </txt>
         <wit ref="1818_fullFlat_C27" pos="3553"/>
         <wit ref="Thomas_fullFlat_C27" pos="3565"/>
         <wit ref="1823_fullFlat_C27" pos="3551"/>
         <wit ref="1831_fullFlat_C27" pos="3555"/>
         <wit ref="msColl_C27" pos="3015"/>
      </c>

Desired output (same collation data, bundled in <app> elements:

<app>
   <rdgGrp>
         <txt>&lt;p/&gt; &lt;p/&gt;Mr</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </rdgGrp>
      <rdgGrp>
         <txt> Mr</txt>
         <wit ref="msColl_C27" pos="3004"/>
      </rdgGrp>
<app>
<app>
      <rdgGrp>
         <txt>. </txt>
         <wit ref="1818_fullFlat_C27" pos="3544"/>
         <wit ref="Thomas_fullFlat_C27" pos="3556"/>
         <wit ref="1823_fullFlat_C27" pos="3542"/>
         <wit ref="1831_fullFlat_C27" pos="3546"/>
         <wit ref="msColl_C27" pos="3007"/>
      <rdgGrp>
<app>
<app>
      <rdgGrp>
         <txt>Kirwin,</txt>
         <wit ref="1818_fullFlat_C27" pos="3546"/>
         <wit ref="Thomas_fullFlat_C27" pos="3558"/>
         <wit ref="1823_fullFlat_C27" pos="3544"/>
         <wit ref="1831_fullFlat_C27" pos="3548"/>
      </rdgGrp>
      <rdgGrp>
         <txt>Kirwin</txt>
         <wit ref="msColl_C27" pos="3009"/>
      </rdgGrp>
</app>
<app>
      <rdgGrp>
         <txt> on hearing this </txt>
         <wit ref="1818_fullFlat_C27" pos="3553"/>
         <wit ref="Thomas_fullFlat_C27" pos="3565"/>
         <wit ref="1823_fullFlat_C27" pos="3551"/>
         <wit ref="1831_fullFlat_C27" pos="3555"/>
         <wit ref="msColl_C27" pos="3015"/>
      </c>
</app>
ebeshero commented 2 years ago

Looking for a place to intervene in the TAN-fn-strings-collate-standard.xsl. What about setting an <app> element down as the first event in the for-each-group on lines 418-419 ?

Arithmeticus commented 2 years ago

The output of tan:collate() can be redirected in any number of ways, not just TEI but docx, HTML, etc. I'm inclined to keep the diff/collate output generic, but propose a postprocessing function. Let me throw together an idea or two.

jk

On Sun, Apr 24, 2022 at 11:40 AM Elisa Beshero-Bondar < @.***> wrote:

Looking for a place to intervene in the TAN-fn-strings-collate-standard.xsl https://github.com/textalign/TAN-2021/blob/master/functions/strings/TAN-fn-strings-collate-standard.xsl. What about setting an element down after line 418 https://github.com/textalign/TAN-2021/blob/730bd16200e38eab3e1d20727bae5d882e194c57/functions/strings/TAN-fn-strings-collate-standard.xsl#L418 ?

— Reply to this email directly, view it on GitHub https://github.com/FrankensteinVariorum/TAN-2021/issues/3#issuecomment-1107865683, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQD53SYOQEPKVK2ROUSVCLVGVTODANCNFSM5UGLWIXA . You are receiving this because you were mentioned.Message ID: @.***>

-- Joel Kalvesmaki kalvesmaki.com

Arithmeticus commented 2 years ago

Looking at your desired output, it is half TEI, half non-TEI. Is that intentional, or are you going to do something different downstream that is un-TEI-like?

I'm wondering why, using "Kirwin," versus "Kirwin" as my example, you wouldn't prefer something like:

<app>
   <lem wit="#1818_fullFlat_C27 #1823_fullFlat_C27 
      #1831_fullFlat_C27 #Thomas_fullFlat_C27">Kirwin,</lem>
   <rdg wit="msColl_C27">Kirwin</rdg>
</app>

Isn't that what most TEI users would want to do if they tried to convert tan:collate() output to TEI?

Also I'm uncertain why <c>s are getting <app> treatment, as if they are subject to variants (that's where every witness agrees). Perhaps this gets back to my initial question about any un-TEI-like goals you have. If that's the case, then a very simple custom <xsl:transform> should make the portmanteau you need. Something like this:

<xsl:mode name="collate-to-my-semi-tei" on-no-match="shallow-skip"/>
<xsl:template match="tan:collation" mode="collate-to-my-semi-tei">
   <xsl:for-each-group select="tan:c | tan:u" group-adjacent="name(.)">
      <app xmlns="http://www.tei-c.org/ns/1.0">
         <xsl:apply-templates select="current-group()" mode="#current"/>
      </app>
   </xsl:for-each-group> 
</xsl:template>
<xsl:template match="tan:c | tan:u" mode="collate-to-my-semi-tei">
   <rdgGrp xmlns="http://www.tei-c.org/ns/1.0">
      <xsl:copy-of select="node()"/>
   </rdgGrp>
</xsl:template>
ebeshero commented 2 years ago

@Arithmeticus Okay, looking back at this, my example wasn't fully "fleshed out". I was just concentrating on converting the <c>s and <u> s into bundled <rdgGrp>s within <app>. And the bundling of those into <app> elements struck me as something we might want to happen at the moment they're collated because it's organizing related passages, easy when the witnesses run in unison, but more questionable when they do not. How do you know when a series of <u> elements output by tan:collate() are related to one another as an instance of divergence--a fork? So when I hastily plotted my example, I didn't flesh it out from the tan:collate() output--I was really just wrapping my head around the <app> question. And on that question--when is it desirable to get the information of a related moment of divergence out of tan: collate()? (Surely not in post-processing?) It's the most important part of the alignment.

So, let me show you what the full TEI output looks like, with few more comments...

ebeshero commented 2 years ago

@Arithmeticus Here's what we're really doing in the Frankenstein Variorum with TEI critical apparatus. We are following the method of Parallel Segmentation as described in the TEI Guidelines Chapter 12.

Indeed, we want to be seeing the text of the original witnesses output in the <rdg witness="X"> elements, but we also want to preserve a view of the normalized tokens at each point for diagnostic reasons.

We definitely don't favor <lem> because there is no favored witness or authoritative agreed-on standard text of Frankenstein. Our edition does not assert a lemma, so we don't use that <lem> element. TEI certainly does permit all <rdg> elements and no <lem>, as <lem> is just an option in TEI critical apparatus markup.

The structure of the edition is driven by the simple hierarchy of <rdgGrp> element(s) nested within <app>. We are keeping with a pretty simple, shallow structure. Having all witnesses running in unison is asserted by there being a count of one <rdgGrp> within an <app>.

So here's what we're building out of collation software, and this is mostly the standard TEI with just one unusual dimension: an explicit view of the normalized tokens for comparison: Here's a view of how collateX generates the collation output in its TEI critical apparatus form. There's just one modification to their standard TEI critical apparatus output as I recall, and that's the @n attribute on the` element, which is holding the normalized tokens shared by the witnesses inside.

  <app>
            <rdgGrp n="['i', 'saw', 'the', 'dull', 'yellow', 'eye', 'of', 'the', 'creature']">
                  <rdg wit="f1818">I saw the dull yellow eye of the creature </rdg>
                  <rdg wit="f1823">I saw the dull yellow eye of the creature </rdg>
                  <rdg wit="fThomas">I saw the dull yellow eye of the creature </rdg>
                  <rdg wit="f1831">I saw the dull yellow eye of the creature </rdg>
                  <rdg wit="fMS">I saw the dull yellow eye of &lt;lb n="c56-0045__main__14"/&gt; the creature </rdg>
            </rdgGrp>
      </app>
      <app>
            <rdgGrp n="['open.—it']">
                  <rdg wit="fMS">open.—It </rdg>
            </rdgGrp>
            <rdgGrp n="['open;', 'it']">
                  <rdg wit="f1818">open; it </rdg>
                  <rdg wit="f1823">open; it </rdg>
                  <rdg wit="fThomas">open; it </rdg>
                  <rdg wit="f1831">open; it </rdg>
            </rdgGrp>
      </app>

We need to start there--that's actually the structure of the output we need from the collation process. Afterwards, we post-process this in two ways:

  1. We use it to reconstitute the original editions and plant <seg> elements in them with @xml:ids holding moments of variance (and these form the basis of the "heatmap" reading view of each edition). So it will become like this in the TEI of the Thomas edition:
I saw the dull yellow eye of the creature <seg xml:id="C10_app29-fThomas">open; it </seg>

(Where there was a single <rdgGrp> in an <app> (the condition of your <c> element) , we don't output a <seg>. Where there's a count of more than one <rdgGrp> within an <app>, we plant a <seg> element in the text.)

  1. We convert the critical apparatus into standoff markup, replacing the text in the <rdg> elements with pointers to the seg elements. This is also fine by the TEI, just another way of expressing a critical apparatus. Here's what this will become as a form of TEI standoff annotation in our post-processing: In the full file, you'll see it gets its own TEI header at this point.
 <app xml:id="C10_app29" n="2">
               <rdgGrp xml:id="C10_app29_rg1" n="['open.—it']">
                  <rdg wit="#fMS">
                     <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/fMS_C10.xml#string-range(//tei:surface[@xml:id='ox-ms_abinger_c56-0045']/tei:zone[@type='main']//tei:line[14],14,23)"/>
                  </rdg>
               </rdgGrp>
               <rdgGrp xml:id="C10_app29_rg2" n="['open;', 'it']">
                  <rdg wit="#f1818">
                     <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1818_C10.xml#C10_app29-f1818"/>
                  </rdg>
                  <rdg wit="#f1823">
                     <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1823_C10.xml#C10_app29-f1823"/>
                  </rdg>
                  <rdg wit="#fThomas">
                     <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/fThomas_C10.xml#C10_app29-fThomas"/>
                  </rdg>
                  <rdg wit="#f1831">
                     <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1831_C10.xml#C10_app29-f1831"/>
                  </rdg>
               </rdgGrp>
            </app>

So about the "TEI-ness" of this: I was part of a TEI-Council-hosted panel on the TEI Critical Apparatus at the TEI 2019 conference in Graz, thinking about what the TEI critical apparatus can do, and yes, the Frankenstein Variorum is trying to say something about the critical apparatus as a way of moving interchangeably between differently-marked editions: here's my part of the slide deck which pretty much expresses what we are trying to do: https://slides.com/elisabeshero-bondar/app-crit#/2

Arithmeticus commented 2 years ago

There are a few issues you raise, and I propose we focus for now on what I think the primary one, the method/technique of consolidating a group of adjacent <u>s to a <rdgGrp>.

If you are content treating all adjacent <u>s as a single block then the xsl template code I proposed earlier should work fine, with the proviso that instead of <xsl:copy-of select="node"/> you may prefer to do <xsl:for-each> over each witness. It's up to you how you rebuild.

A more difficult question arises if you are not happy treating adjacent <u>s as a single block and you want to identify chunks of as you put it related moments of divergence (RMDs). The problem is that RMDs like an Escher drawing tesselate, overlap, and interleave, and two or more people could reasonably disagree on the principle for grouping those RMDs.

Consider the following:

XX12c34g78YY
XX12de567890YY
XXa3456h90YY
XXb23f78YY

Anyone wanting to cluster/group the differences between common substrings XX and YY will need to make decisions about where one <rdgGrp> should end and another begin. Should "2c3" and "23" form a RMD? Answers to such questions require a biased criterion that prioritizes some RMDs or witnesses over others.

Out of the box, tan:collate() attempts to capture differences and commonality in a viable but highly granular sequence of <u>s that preserve the sequence of text in each version. Bias is given to the witnesses that have the highest average of pairwise commonality, locally within the block of adjacent <u>s. The example above has the following output:

<c>
   <txt>XX</txt>
   <wit ref="1" pos="1"/>
   <wit ref="2" pos="1"/>
   <wit ref="4" pos="1"/>
   <wit ref="3" pos="1"/>
</c>
<u>
   <txt>1</txt>
   <wit ref="1" pos="3"/>
   <wit ref="2" pos="3"/>
</u>
<u>
   <txt>b</txt>
   <wit ref="4" pos="3"/>
</u>
<u>
   <txt>2</txt>
   <wit ref="1" pos="4"/>
   <wit ref="2" pos="4"/>
   <wit ref="4" pos="4"/>
</u>
<u>
   <txt>c34g</txt>
   <wit ref="1" pos="5"/>
</u>
<u>
   <txt>de56</txt>
   <wit ref="2" pos="5"/>
</u>
<u>
   <txt>a</txt>
   <wit ref="3" pos="3"/>
</u>
<u>
   <txt>3</txt>
   <wit ref="4" pos="5"/>
   <wit ref="3" pos="4"/>
</u>
<u>
   <txt>f</txt>
   <wit ref="4" pos="6"/>
</u>
<u>
   <txt>78</txt>
   <wit ref="1" pos="9"/>
   <wit ref="2" pos="9"/>
   <wit ref="4" pos="7"/>
</u>
<u>
   <txt>90</txt>
   <wit ref="2" pos="11"/>
</u>
<u>
   <txt>456h90</txt>
   <wit ref="3" pos="5"/>
</u>
<c>
   <txt>YY</txt>
   <wit ref="1" pos="11"/>
   <wit ref="2" pos="13"/>
   <wit ref="4" pos="9"/>
   <wit ref="3" pos="11"/>
</c>

With so many RMDs, choices need to be made, such as why the RMD "34" wasn't captured. Try capturing it. You must sacrifice some other RMD, if you are committed to a granular approach that preserves text order of each witness.

If granularity is not important, and you want to group RMDs in constellations, so to speak, you still have to choose some criterion for constellation formation. That faces the original problem: tesselation and overlap. The example above is actually rather simple. As the number of versions group, so do the challenges behind the choices.

If you cannot articulate a criterion for the creation of RMD constellations, then you cannot code a solution. If you can articulate one, you might be able to. (And you have to be aware that other people may vehemently disagree with the principles you adopt.) Then you start. In my XSLT code in an answer above I proposed <xsl:for-each-group select="tan:c | tan:u" group-adjacent="name(.)">. Instead of name(.) you should define your own function that returns a value for distinct RMD constellations, based upon how you envision the ideal of constellation formation.

Many such constellation functions could be written. Few would be trivial to write. Few would be widely adopted.

If at the end you go back to simply one <rdgGrp> for each block of adjacent <u>s keep in mind that some people may complain because such blanket consolidation would obliviate any RMD constellations.