FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

up-conversion to pre-TEI XML for collation #18

Closed ebeshero closed 6 years ago

ebeshero commented 7 years ago

Now that we've completed proof-checking of all three base texts (1818, 1823, and 1831), I'm opening this issue to rough out the steps to prepare the documents for collation. In preparing a test collation working with Ch. 1 of 1818 and 1823 with Chs. 1-2 in 1831 on 3 May 2017, @djbpitt and I determined that it's easier to process the documents with collateX if they have been "XML-ified", that is converted from text format to XML. This is largely because our system of pseudo-markup to was becoming challenging to parse with regular expressions: we need to screen from the collation certain patterns from the collation (page-breaks, XML comments we've added among others), and we also want to preserve certain tags in the output XML from the collation. It turned out on experimentation that it's simply easiest to write legible Python code to process this if we're applying XML parsing rather than reading increasingly complex regular expression patterns.

Basically, we're converting our text files to a simple form of XML that will serve the purposes of collation. The elements in use are:

xml
header
ref @target=<URL> : in header only
resp : in header only
list @type="numbered" : in header only
item : in header only
title @level="m" : in header only
edition : in header only

text
div @type="editIntro" | "novel" | "preface" | "introduction" | "letter" | "chapter" | "frontmatter" 
head
p
ab @rend="center"
hi @rend="italic" | "smallcaps"
pb @xml:id=<xml:id> @n=<integer>
epigraph
cit
quote
lg
l @rend="i2" | "i5"
bibl
note
milestone @xml:id=<xml:id>: this element is to mark aligned units of collation across the three texts.

Attributes:

@level
@type
@rend
@target
@xml:id
@n
ebeshero commented 7 years ago

Updated header: to be adjusted slightly for each of the three texts. Because these separate preliminary XML files may serve as editions in their own right, the header is part of each file. However, it should be ignored for the purposes of collation.

  <header>
    <title level="m">FRANKENSTEIN; OR, THE MODERN PROMETHEUS</title>

    <edition>The Pittsburgh Bicentennial Edition</edition>

    <div type="editIntro">
      <head>INTRODUCTORY NOTE ON THE TEXT</head>

      <p>This is a digital edition prepared for the Frankenstein Bicentennial project, commemorating the 200th anniversary of the first published edition of <hi rend="italic">Frankenstein; or, the
          Modern Prometheus</hi> in
        1818.</p>

      <p><title level="m">Frankenstein; or, the Modern Prometheus: Pittsburgh Bicentennial
          Edition</title> is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike
        4.0 International License.</p>

      <p>Date this text was produced: 2017-05-20</p>

      <p>This edition of the<!--ebb: IDENTIFY WHICH: [1818 | 1823 | 1831] --> text is part of the Pittsburgh research team’s contribution to the Bicentennial
        Frankenstein Project, and is prepared by Elisa Beshero-Bondar of the University of
        Pittsburgh at Greensburg and Rikk Mulligan of Carnegie Mellon University. We are grateful
        for consultation from Wendell Piez, David J. Birnbaum, and Raffaele Viglianti, as well as
        Neil Fraistat and Dave Rettenmaier. This edition’s stages of development are stored and
        documented in the <ref target="https://github.com/ebeshero/Pittsburgh_Frankenstein/"
          >Pittsburgh_Frankenstein GitHub repository</ref>.</p>

      <resp>We have produced this XML edition for two purposes:</resp>
      <list type="numbered">
        <item>To prepare for automated collation of the 1818, 1823, and 1831 editions of <hi
            rend="italic">Frankenstein</hi> using CollateX, in order to generate a TEI XML document
          that stores the variations of these texts.</item>

        <item>To provide a reliable digital base text of each edition tractable for collation, as a
          stage in preparing a TEI P5 edition, and as a basis for related projects.</item>
      </list>

<!--ebb: REMOVE THE NEXT PARAGRAPH FOR THE 1823 TEXT-->
      <p>This edition is one of two that share the same electronic source, representing the 1818 and
        1831 editions of the novel. This pair of editions is based on the Pennsylvania Electronic
        Edition of <hi rend="italic">Frankenstein; or, the Modern Prometheus</hi> by Mary Shelley,
        edited by Stuart Curran and assisted by Jack Lynch, located at <ref
          target="http://knarf.english.upenn.edu/">http://knarf.english.upenn.edu/</ref> and
        hereafter referred to as PA EE. Elisa Beshero-Bondar and Rikk Mulligan have corrected these
        texts against photo facsimiles of the 1818 and 1831 publications.</p>

      <!--<p>Our plain text edition preserves the rendering of italics, square brackets, and centered
        text from the PA EE HTML texts.</p>
-->
      <p>We have added page breaks using the TEI self-closing pb element. We will not be preserving
        the running headers but will retain page number in order to link to page images in future
        iterations of the text.</p>

      <!--<p>To track pages from our source images we will be using the TEI self-closing milestone
        element with "unNum" as the marker.</p>-->

      <list>
        <!-- <item>In the PA EE there is no distinction between italics for titles and italics for
          emphasized words. Italics have been converted to hi elements with @rend="italic"</item>-->

        <item>We preserved italics and small caps.</item>

        <item>We have not normalized the spelling, and where the PA EE did so, we have silently
          restored the spelling to its original state as observed in the photo facsimile
          nineteenth-century editions. To prevent errors in collation, we have commented out editor
          notes written by us recording our observations of errors in the source text. We have also
          commented out the one instance in the 1831 PA EE HTML in which square brackets were used
          to hold a normalized variant of a word, to suppress that from the output.</item>

        <!--<item>When centered text is the heading of a chapter or book, we render it inside head
          elements. Otherwise we use hi @rend="center".</item>-->
      </list>
      <!--<note resp="#ebb">Note for later processing: In the PA EE of this text, there are 1337 encoded links, each pointing to an editorial annotation.</note> -->
    </div>
  </header>
ebeshero commented 7 years ago

Decisions:

Example:

<anchor type="collate" xml:id="C1"/>
ebeshero commented 7 years ago

Decisions:

ebeshero commented 7 years ago

Consultation with @djbpitt on 22 May: Decided we need to flatten the XML hierarchy, so that we can process the files in collation "chunks" such that each chunk can be extracted by itself a well-formed xml file.

The problem we have is that units of text to be collated represent pieces of larger elements (e.g. the divs that hold vols. 1 and 2 in 1818 close and open inside a collation unit, or it doesn't open in the unit but closes there.

The solution is to flatten the hierarchy by using self-closing elements. For this I'm applying the <milestone/> element with an @unit indicating letter, chapter, volume, preface, etc and a @n to indicate its count in sequence. The file structure is much simpler and the information about text structure is preserved, and we are now using fewer elements. We are also, by necessity, only creating collation units inside the text portion of the full file, which simplifies the collation process further.

The list of elements to be processed with collation is now:

pb  (ignore in collation but include intact in output)
comment (ignore in collation but include intact in output)

anchor
milestone
include
head
p
hi
ab
cit
quote
lg
l
note
bibl

Unless otherwise marked (unless pb or comment), these elements are to be included in the collation process.