Controlling the tokenizing? / Order of replacements

ebeshero commented 2 years ago

@Arithmeticus Can you offer a little guidance on how to control the tokenization?

Context: I'm getting some problematic collation output when I'm collating with tags. Here's a sample around a simple name with some bumpy normalized tags in my manuscript witness. We begin with a <p/> marker in all but the MS witness, which has some highlighting going on.

 <u>
         <txt>&lt;p/&gt; &lt;p</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> &amp;gt;M&lt;shi rend="sup"&gt;r&lt;</txt>
         <wit ref="msColl_C27" pos="3024"/>
      </u>
      <c>
         <txt>/</txt>
         <wit ref="1818_fullFlat_C27" pos="3540"/>
         <wit ref="Thomas_fullFlat_C27" pos="3552"/>
         <wit ref="1823_fullFlat_C27" pos="3538"/>
         <wit ref="1831_fullFlat_C27" pos="3542"/>
         <wit ref="msColl_C27" pos="3048"/>
      </c>
      <u>
         <txt>&gt;Mr</txt>
         <wit ref="1818_fullFlat_C27" pos="3541"/>
         <wit ref="Thomas_fullFlat_C27" pos="3553"/>
         <wit ref="1823_fullFlat_C27" pos="3539"/>
         <wit ref="1831_fullFlat_C27" pos="3543"/>
      </u>
      <u>
         <txt>shi&gt;</txt>
         <wit ref="msColl_C27" pos="3049"/>
      </u>
      <c>
         <txt>. </txt>
         <wit ref="1818_fullFlat_C27" pos="3544"/>
         <wit ref="Thomas_fullFlat_C27" pos="3556"/>
         <wit ref="1823_fullFlat_C27" pos="3542"/>
         <wit ref="1831_fullFlat_C27" pos="3546"/>
         <wit ref="msColl_C27" pos="3053"/>
      </c>
      <u>
         <txt>Kirwin,</txt>
         <wit ref="1818_fullFlat_C27" pos="3546"/>
         <wit ref="Thomas_fullFlat_C27" pos="3558"/>
         <wit ref="1823_fullFlat_C27" pos="3544"/>
         <wit ref="1831_fullFlat_C27" pos="3548"/>
      </u>
      <u>
         <txt>Kirwin</txt>
         <wit ref="msColl_C27" pos="3055"/>
      </u>

ebeshero commented 2 years ago

I'm reworking this b/c I realize I don't actually care about <shi> tags in the collation (just superscripts / subscripts marked in the ms witness). Screening those out in a replacement pattern generated a new wrinkle, and I think this time due to the order in which the replacements are made:

There's a lot being normalized away ("munched") even from the MS witness in the middle of this sequence:

<c>
         <txt>corpse.</txt>
         <wit ref="1818_fullFlat_C27" pos="3526"/>
         <wit ref="Thomas_fullFlat_C27" pos="3538"/>
         <wit ref="1823_fullFlat_C27" pos="3524"/>
         <wit ref="1831_fullFlat_C27" pos="3528"/>
         <wit ref="msColl_C27" pos="2972"/>
      </c>
      <u>
         <txt>&lt;p/&gt; &lt;p/&gt;Mr. Kirwin, on hearing this evidence, desired that I should be taken into the room where the body lay for interment, that it might be observed what effect the sight of it would produce</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> &amp;gt;Mrd&lt;del/&gt;</txt>
         <wit ref="msColl_C27" pos="2979"/>
      </u>
      <c>
         <txt> upon me. This idea was probably suggested by the extreme agitation I had exhibited </txt>
         <wit ref="1818_fullFlat_C27" pos="3726"/>
         <wit ref="Thomas_fullFlat_C27" pos="3738"/>
         <wit ref="1823_fullFlat_C27" pos="3724"/>
         <wit ref="1831_fullFlat_C27" pos="3728"/>
         <wit ref="msColl_C27" pos="2993"/>
      </c>

Problem passage:

 <u>
         <txt> &amp;gt;Mrd&lt;del/&gt;</txt>
         <wit ref="msColl_C27" pos="2979"/>
      </u>

ebeshero commented 2 years ago

Trying to fix this by changing the order of the replacements, to do the little <shi> adjustment before I process <del>s...

ebeshero commented 2 years ago

...And discovered the unhappy cause in the source document, a stray right angle bracket, which can throw everything off. Fortunately, we can normalize it away as a pattern, early on.

<lb n="c57-0119__main__18"/> &gt;M<shi rend="sup">r</shi>. Kirwin

ebeshero commented 2 years ago

Corrected! Good collation attained!

 <u>
         <txt>&lt;p/&gt; &lt;p/&gt;Mr</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> Mr</txt>
         <wit ref="msColl_C27" pos="3004"/>
      </u>
      <c>
         <txt>. </txt>
         <wit ref="1818_fullFlat_C27" pos="3544"/>
         <wit ref="Thomas_fullFlat_C27" pos="3556"/>
         <wit ref="1823_fullFlat_C27" pos="3542"/>
         <wit ref="1831_fullFlat_C27" pos="3546"/>
         <wit ref="msColl_C27" pos="3007"/>
      </c>
      <u>
         <txt>Kirwin,</txt>
         <wit ref="1818_fullFlat_C27" pos="3546"/>
         <wit ref="Thomas_fullFlat_C27" pos="3558"/>
         <wit ref="1823_fullFlat_C27" pos="3544"/>
         <wit ref="1831_fullFlat_C27" pos="3548"/>
      </u>
      <u>
         <txt>Kirwin</txt>
         <wit ref="msColl_C27" pos="3009"/>
      </u>
      <c>
         <txt> on hearing this </txt>
         <wit ref="1818_fullFlat_C27" pos="3553"/>
         <wit ref="Thomas_fullFlat_C27" pos="3565"/>
         <wit ref="1823_fullFlat_C27" pos="3551"/>
         <wit ref="1831_fullFlat_C27" pos="3555"/>
         <wit ref="msColl_C27" pos="3015"/>
      </c>

ebeshero commented 2 years ago

@Arithmeticus Standing question: Can we / should we be able to define a tag: </?.+?/>> as an unbreakable token, not to be divided up? I'm watching for this as a sign of trouble...

Arithmeticus commented 2 years ago

@ebeshero

I am experimenting with this basic token definition, which treats serialized tags as tokens on par with "standard" word tokens:

<token-definition pattern="</?\i\c*.*?>|[\w‍-[<>]]+" flags=""/>

Is this the sort of basic raw output you're hoping for, when you snap to word?:

<diff xmlns="tag:textalign.net,2015:ns">
   <common>&lt;xml xml:lang="en"&gt;
   &lt;anchor type="collate" xml:id="C11"/&gt;
        </common>
   <a>&lt;milestone unit="chapter" type="start" n="5"/&gt;</a>
   <b>&lt;milestone unit="chapter" type="start" n="6"/&gt;</b>
   <common>
          </common>
   <a>&lt;head sID="novel1_letter4_chapter5_div4_div5_head1"/&gt;</a>
   <b>&lt;head sID="novel1_letter4_chapter6_div4_div6_head1"/&gt;</b>
   <common>CHAPTER </common>
   <a>V</a>
   <b>VI</b>
   <common>.</common>
   <a>&lt;head eID="novel1_letter4_chapter5_div4_div5_head1"/&gt;</a>
   <b>&lt;head eID="novel1_letter4_chapter6_div4_div6_head1"/&gt;</b>
   <common>
          </common>
   <a>&lt;p sID="novel1_letter4_chapter5_div4_div5_p1"/&gt;</a>
   <b>&lt;p sID="novel1_letter4_chapter6_div4_div6_p1"/&gt; </b>
   <common>C</common>
   <a>&lt;hi sID="novel1_letter4_chapter5_div4_div5_p1_hi1"/&gt;</a>
   <b>&lt;hi sID="novel1_letter4_chapter6_div4_div6_p1_hi1"/&gt;</b>
   <common>LERVAL</common>
   <a>&lt;hi eID="novel1_letter4_chapter5_div4_div5_p1_hi1"/&gt;</a>
   <!--Trimming next 603 nodes (deep skip)-->
</diff>

ebeshero commented 2 years ago

@Arithmeticus @Yuying-Jin Yes indeed, that's the output we were hoping for! (I am sorry I didn't reply to this when you posted, but we're returning to work on it now!)

ebeshero commented 2 years ago

@Yuying-Jin Welcome to the TAN DIFF XSLT experiment on Frankenstein! Let's see what we can do here... :-)

Arithmeticus commented 2 years ago

Just FYI, there is a global parameter, $tan:token-definition-default under parameters/param-application.xsl that will change the base default value of the definition of token. That can be configured as you like.

Because you're comparing serialized XML, that definition isn't cutting it. Strange thing is, I have run across the same need at work. I've done some experimentation, and I'm getting better results when I redefine the global parameter as follows:

<xsl:param name="tan:token-definition-default" as="element()">
      <token-definition pattern="&#x3c;[^&#x3e;]+&#x3e;|[\w&#xad;&#x200b;&#x200d;]+|[^\w&#xad;&#x200b;&#x200d;\s]" flags=""/>
</xsl:param>

FrankensteinVariorum / TAN-2021

Controlling the tokenizing? / Order of replacements #2