FrankensteinVariorum / fv-postCollation

a repository for post-processing finalized collation files to prepare the Variorum edition.
2 stars 0 forks source link

Setting anchors and reading Left Margin Zone conditions #13

Closed ebeshero closed 1 year ago

ebeshero commented 5 years ago

@raffazizzi @zmbq @mdlincoln I've been working on improving our capacity to match pointers to S-GA left margin zones. In the past couple of days I have been working on an XSLT (using a lot of xsl:analyze-string) to locate, wherever possible, the missing <anchor xml:id="locatorValue"/> elements in the S-GA files where they signal a move to a left margin zone. The goal is to plant as many of these anchors in our collation output as possible, and position them precisely in front of the first <lb> element inside a left margin zone. I have been planting a lot of these anchors successfully.

I'm working on this in an early stage of my pipeline called P1, which holds the variant collation data (containing the full text of all five versions of Frankenstein all organized in <app> elements)--these are the P1 files that later are transformed into the XML pointers of our Spine files. I'm stopping on left_margin indicators in our P1 files, and wherever possible, I am looking up their <anchor> locator signals in the original S-GA files. And I am now reliably finding them in their most clearly signaled condition: whenever there is at least one <lb> element from the main zone co-present with an <lb> from a margin zone in an <app>. I have succeeded in setting <anchor> elements holding locational information just before the first indicator of a left margin in an <rdg> that holds some combination of main surface and margin text. I think that data may help with the XML pointers and string range calculations when you need to trace your way into the S-GA margins and/or back out again to the main zones.

To see what I've been able to find and the positioning of the anchor information, take a look at this directory in the fv-postCollation repo: https://github.com/PghFrankenstein/fv-postCollation/tree/master/postColl-workspace/anchorMgmt/P1-anchorOut The files in this directory are peppered with comment tags and anchor elements holding locational signals to their host line element in the main surface, and the specific left-margin zone they point to. I think you may want to survey these files to get a sense of the state of our left margins. (It might help to pull all the <rdg> elements holding <anchor> elements to survey them--I can write up a quick XQuery file to do this if you like.)

Another way to survey the left margin data is by running my anchorage.xsl file (you can run it in over any XML file you like b/c it processes a collection of files, but its source directory is P1 in the fv-postCollation workspace). When you run it, you see a long list of console messages that stop over every left margin indicator I've found in our batch of files with the MS witness present (C07 to C10).

This XSLT both plants some anchor elements and reliably surveys the lay of the left-margin landscape. However, it isn't finished yet! The next hurdle is to trace the instances when a short snip of left-margin text (and/or left-margin element) from the fMS witness is the only thing present in an app/rdg[wit="fMS"]. In those cases, we don't have signals present from the main surface. I'm flagging these locations and stringing all available text (that isn't part of a flattened element) into a variable.

These special short-snip left-margin texts can be divided into two types: 1) those that have significant text strings that can be reliably matched to distinctive passages in S-GA files, and 2) those that do NOT have significant text strings: either they are empty, or they have a single not-very-special character like a punctuation mark.

In case 1, my next step is to query the S-GA files with my text strings, to search for the left margin zones containing that string (using the XPath string() function that looks deep inside descendant nodes). I'm confident I'll be able to locate the left margin zones in question and grab their @corresp attributes, hinged to those <anchor> elements we seek in the main surface. And we can plant an anchor mark ahead of the left margin zone there.

Case 2 is more challenging. I think we'd need to reach into the first preceding <rdg> that holds a main surface <lb>, look it up in the S-GA and search for an <anchor> signal indicator to a left margin zone in its vicinity. Either that or we just import and unpack the string of the simple text and/or surrounding elements at that point--it's just a white space or a punctuation mark anyway, and maybe it's a deleted period or something tiny though significant.

One last point: I set <anchor> elements (with attributes) in place with the idea of sending some data "up the line" to our Spine files. I could format that information any way that is convenient to those working on constructing pointers--ultimately that's what this is for. As you take a look through my output, think about how you might want to see it or find it in the Spine files, and we can package this however is convenient to us. Hope this helps!

ebeshero commented 5 years ago

Here is an example of output with an anchor set in place before a left margin zone:

<app xml:id="C07e_app91">
                     <rdgGrp xml:id="C07e_app91_rg1"
                       n="['&lt;del&gt;she&lt;del&gt;by&lt;del&gt;was&lt;del&gt;thedesire', 'of&lt;del&gt;knew', 'how&lt;del&gt;']">
                           <rdg wit="fMS">&lt;del sID="c56-0012__main__d2e502" 
type="overwritten"/&gt;she&lt;del eID="c56-0012__main__d2e502"/&gt;by&lt;
del next="#c56-0012.02" rend="strikethrough" sID="c56-0012__main__d2e508"/&gt;was&lt;del
 eID="c56-0012__main__d2e508"/&gt;the
<!--ANCHOR MATCH ON PRECEDING AXIS--><anchor xml:id="c56-0012.03"/>&lt;lb n="c56-
0012__left_margin__1"/&gt;desire of&lt;lb n="c56-0012__main__11"/&gt;&lt;del rend="strikethrough" 
sID="c56-0012__main__d2e527" xml:id="c56-0012.02"/&gt;knew how&lt;del eID="c56-
0012__main__d2e527"/&gt; </rdg>
                     </rdgGrp>
                     <rdgGrp xml:id="C07e_app91_rg2" n="['through']">
                           <rdg wit="f1818">through </rdg>
                           <rdg wit="f1823">through </rdg>
                     </rdgGrp>
               </app>
zmbq commented 5 years ago

I have another idea. What if you add coordinates on the page (x and y) for each element? Then you can quite easily correlate side notes to text, because you know where they are on the page.

Storing page coordinates will also help rendering if you want to be particularly accurate. It may even let you overlay the text on the image of the page (something cool we're doing in a dead sea scrolls project).

If you store these coordinates, all matters of correlation become pretty trivial for visualization. You can still store semantic references between notes and the main text. These will represent your interpretation of the relation between them. Page coordinates are objective.

Itay

⁣נשלח מ-BlueMail ​

ב-28 ביוני 2019, 6:55, ב-Elisa Beshero-Bondar notifications@github.com, 6:55 כתב:

@raffazizzi @zmbq I've been working on improving our capacity to match pointers to S-GA left margin zones. In the past couple of days I have been working on an XSLT (using a lot of xsl:analyze-string) to locate, wherever possible, the missing <anchor xml:id="locatorValue"/> elements in the S-GA files where they signal move to a left margin zone. The goal is to plant as many of these anchors in our collation output as possible, and position them precisely in front of the first <lb> element inside a left margin zone. I have been planting a lot of these anchors successfully.

I'm working on this in an early stage of my pipeline called P1, which holds the variant collation data (containing the full text of all five versions of Frankenstein all organized in <app> elements)--these are the P1 files that later are transformed into the XML pointers of our Spine files. I'm stopping on left_margin indicators in our P1 files, and wherever possible, I am looking up their <anchor> locator signals in the original S-GA files. And I am now reliably finding them in their most clearly signaled condition: whenever there is at least one <lb> element from the main zone co-present with an <lb> from a margin zone in an <app>. I have succeeded in setting <anchor> elements holding locational information just before the first indicator of a left margin in an that holds some combination of main surface and text. I think that data may help with the XML pointers and string range calculations when you need to trace your way into the S-GA margins and/or back out again to the main zones.

To see what I've been able to find and the positioning of the anchor information, take a look at this directory in the fv-postCollation repo: https://github.com/PghFrankenstein/fv-postCollation/tree/master/postColl-workspace/anchorMgmt/P1-anchorOut

The files in this directory are peppered with comment tags and anchor elements holding locational signals to their host line element in the main surface, and the specific left-margin zone they point to. I think you may want to survey these files to get a sense of the state of our left margins. (I suggest skimming through the collection with XQuery, looking for <anchor> elements--I can write up a quick file to do this if you like.)

Another way to survey the left margin data is by running my anchorage.xsl file (you can run it over any XML file you like b/c it processes a collection of files, but its source directory is P1 in the fv-postCollation workspace). When you run it, you see a long list of console messages that stop over every left margin indicator I've found in our batch of files with the MS witness present (C07 to C10).

This XSLT both plants some anchor elements and reliably surveys the lay of the left-margin landscape. However, it isn't finished yet! The next hurdle is to trace the instances when a short snip of left-margin text (and/or left-margin element) from the fMS witness is the only thing present in an app/rdg[wit="fMS"]. In those cases, we don't have signals present from the main surface. I'm flagging these locations and stringing all available text (that isn't part of a flattened element) into a variable.

These special short-snip left-margin texts can be divided into two types: 1) those that have significant text strings that can be reliably matched to distinctive passages in S-GA files, and 2) those that do NOT have significant text strings: either they are empty, or they have a single not-very-special character like a punctuation mark.

In case 1, my next step is to query the S-GA files with my text strings, to search for the left margin zones containing that string (using the XPath string() function that looks deep inside descendant nodes). I'm confident I'll be able to locate the left margin zones in question and grab their @corresp attributes, hinged to those <anchor> elements we seek in the main surface. And we can plant an anchor mark ahead of the left margin zone there.

Case 2 is more challenging. I think we'd need to reach into the first preceding <rdg> that holds a main surface <lb, look it up in the S-GA and search for an <anchor> signal indicator to a left margin zone in its vicinity. Either that or we just import and unpack the string of the simple text and/or surrounding elements at that point--it's just a white space or a punctuation mark anyway, and maybe it's a deleted period or something tiny though significant.

One last point: I set <anchor> elements (with attributes) in place with the idea of sending some data "up the line" to our Spine files. I could format that information any way that is convenient to those working on constructing pointers--ultimately that's what this is for. As you take a look through my output, think about how you might want to see it or find it in the Spine files, and we can package this however is convenient to us. Hope this helps!

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/PghFrankenstein/fv-postCollation/issues/13

ebeshero commented 5 years ago

@zmbq That is cool, and I know @raffazizzi did work on this in other projects like http://research.cch.kcl.ac.uk/proust_prototype/. The S-GA itself expresses a lot of correlation between text and image as a diplomatic edition. But I don’t think it supplies us X and Y coordinates for each element, and to prepare image overlays is way out of scope for us. It is also distracting from our goal of visualizing not original documents but comparisons between them. In the meeting you missed on Tuesday, our talk of rendering addressed this goal—we are actively resisting a diplomatic view of these documents or even the illusion of one. Rather the idea is to be representing how the semantic flow of the texts compares with one another, not the frozen fixity of marks on surfaces. We are, if you will, “chunk-bound” rather than “surface bound”. Finally the coordinate data is simply unavailable. What we need here are the identifiers of specific left margin zones to complete an XPath pointer, and for that S-GA supplies us a series of @xml:ids.