FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

Standoff Pointers to All Editions #52

Closed ebeshero closed 5 years ago

ebeshero commented 6 years ago

@raffazizzi I've been tinkering with XSLT on my way home from DHSI, but figured I'd better run some questions here right now about what we want for the TEI output for our edition. Here's what I'm thinking:

I'm planning to study the TEI options typically in use for this kind of thing in the Guidelines after I get done with an afternoon meeting today, but if you have some ideas, please post here! I'll add more as I'm working on this. Outputting the individual editions won't be too difficult once I get through parsing the flattened tags, and I hope to have something up by tonight for us to tinker with.

ebeshero commented 6 years ago

Here's a basic idea for the first, super-easy "spinal column" TEI output, holding just collation unit 10 for the moment: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/blob/Text_Processing/collateXPrep/C10_xmlOutput/teiUp-tester.xml

raffazizzi commented 6 years ago

Hello!

  1. Sounds good, though after we have a (curated) stand-off collation pointing to all the documents, we should be able to generate this TEI by resolving the pointers. (This could be a good test to prove that our approach works)

  2. In the past I have used "semantically weak" elements (though my supervisor wasn't too happy with this choice of words) such as <w> or <seg>. To me <seg> is the closest it gets to HTML's <span>. I don't think we need to leave any information about other sources in there: the collation provides those. Real life example from Early Modern Songscapes:

L638_1.xml

<lg type="stanza">
        <l><hi rend="italic" xml:id="test">THesus,</hi> O <hi rend="italic">Thesus,</hi> hark! but yet in vain;</l>
          <l rend="indent2">A-las <w xml:id="v1">de–-ser-ted</w> I complain;</l>
        <!-- etc. -->
      </lg>

BL_53723.xml

<lg type="stanza">
          <l>Theseous! ô theseus! heark! but yet in vaine,</l>
          <l rend="indent2">alas <w xml:id="v1">forsaken</w> I Complaine;</l>
        <!-- etc. -->
      </lg>

Standoff collation

<app>
  <rdg wit="#BL_53723">
    <ptr target="BL_53723.xml#v1"/>
  </rdg>
  <rdg wit="#L638_1">
    <ptr target="L638_1.xml#v1"/>
  </rdg>
</app>
ebeshero commented 6 years ago

@raffazizzi @Rikkm I've written a lot of XSLT code today, and I think I've got some good output for us to work with from this commit: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/commit/0006ae6bdab9031cb7c46312201ebbe2d6a23eeb

There are a couple of new file directories inside collateXPrep to look at now: 1) standoff-Spine: Look at this first: it's my output of standoff collation pointers into all the files. My output scoops up sga, too just because it's going through all my witnesses--but I figure that you can transform this output by maybe mapping your pointers into it with an ID transform? ( I wonder if that's easier than integrating your functions into the XSLT that generated this?)

1) bridge-P2: new edition files with <seg> elements. Currently these elements are self-closed, but they're set around variant passages. https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/bridge-P2 I need to take this through a new "phase 3" of the "Bridge Construction" process. My plan for phase three is to output a bridge-3 directory with edition files in which I've reconstructed the flattened tags into elements, and then re-set <seg> elements. That's going to be a lot of play with <xsl:analyze-string> and I need some sleep before I try it! I think I can get this out tomorrow, though...

My last plan is to run a new Python script over the collation output strings, probably the ones in the standoff-Spine directory, to return the largest Levenshtein distance numbers, and plant those in attributes on each <app> element. I'll then need to write some more XSLT to deliver that information to the edition files and their seg elements--and we can use that for giving hues of intensity of highlighted passages in the output edition files.