FrankensteinVariorum / fv-postCollation

a repository for post-processing finalized collation files to prepare the Variorum edition.
2 stars 0 forks source link

teiHeader encodingDesc for the TEI Spine Files #34

Open ebeshero opened 1 year ago

ebeshero commented 1 year ago

@raffazizzi @Yuying-Jin @djbpitt As we're finalizing our Spine files, I'm working on updating our very rough <teiHeader> elements to properly document what we're doing in the project. I've been working on an to describe some of our pipeline process and document the way our links from the Variorum to the Shelley-Godwin Archive are being signaled in our new and improved Spines. This is a rough draft. Can you help me modify it? @djbpitt, we are acknowledging your assistance in modifying the collateX algorithm here, so please let me know what you think of how we're representing you in our spine!

 <encodingDesc>
         <editorialDecl>
            <p>This is a <q>spine</q> file representing a standoff critical apparatus for one of 33
               aligned collation units or <q>chunks</q> that start and end in parallel across five
               versions of the novel <title level="m">Frankenstein</title>. It is produced from a
               machine-assisted process working with the collation software collateX, but our output
               is not standard. CollateX calculates alignments and variations based on dividing
               documents into tokens based on words or characters, and it allows for its adapters to
               develop a normalization algorithm, to indicate that some strings, like <q>&amp;</q>
               should be read as <q>and</q> by the software. The normalization mechanism afforded by
               collateX permits us to compare markup of chapter and paragraph boundaries in simpler
               forms. For example, it allows for TEI surface-and-zone markup of paragraphs in
                  <gi>milestone</gi> elements to be normalized as identical with <gi>p</gi> elements
               used in other editions. We have developed a very complex, lengthy series of
               normalizations
               <!--ebb: documented on our website or Jupyter notebook?: provide link here -->, and
               we want to expose them in our TEI representation of the comparison data in our
               project.</p>
            <p>Our output of this spine file is not standard for collateX because we are
               purposefully sharing an array of normalized tokens in an <att>n</att> attribute on
               each <gi>rdgGrp</gi> element. This allows us to indicate the basis on which the
               witnesses align when they are frequently not identical character-by-character. We are
               grateful to David J. Birnbaum in his role as a member of the collateX team for
               assisting us by locally editing the collateX 1.7.1 Python library for our project to
               expose the normalized tokens in our TEI Variorum Spine files.</p>
            <p>Additionally, we developed a post-collation processing pipeline with XSLT and Python
               to to calculate Levenshtein (or edit-distance) values for each pair-wise comparison
               possible at each moment of variation represented in a <gi>app</gi>, and we output the
               maximum of these values in the <att>n</att> attribute on each <gi>app</gi>
               element.</p>
            <p>Produced from collation output prepared in batch file processing on 2023-05-18
               07:39:49.913777 .</p>
            <p>Edited to correct alignments and prepared for the Frankenstein Variorum spine on
               2023-07-14T02:43:29.592648-04:00.</p>
         </editorialDecl>
         <appInfo>
            <application ident="collateX" version="1.7.1">
               <label>collateX</label>
            </application>
         </appInfo>
         <listPrefixDef>
            <desc>When the manuscript notebook witness shows content, the <gi>witDetail</gi> element
               includes a <att>target</att> attribute that can be used to construct links to the
               Shelley-Godwin Archive website display of the particular page on which this appears.
               We also provide links directly to the XML documents together with XPath and
               string-range data that should point to the specirfic location in the Shelley-Godwin
               Archive TEI for this page. The prefixDef below indicates how to construct the links: </desc>
            <prefixDef ident="s-ga" matchPattern="(c\d+/#/p\d+)"
               replacementPattern="https://shelleygodwinarchive.org/sc/oxford/ms_abinger/$1">
               <p>For example, s-ga:c57/#/p73 resolves to the URL <ref
                     target="https://shelleygodwinarchive.org/sc/oxford/ms_abinger/c57/#/p73"
                     >https://shelleygodwinarchive.org/sc/oxford/ms_abinger/c57/#/p73</ref> linking
                  to the webpage that represents page 73 in box 57 of the Oxford Abinger
                  notebooks.</p>
            </prefixDef>
         </listPrefixDef>
      </encodingDesc>
ebeshero commented 1 year ago

^^^^ I've edited this a bit since I posted, so please read here on GitHub rather than in your e-mail. :-) Thanks, all!

Also, we have more work to do on our TEI headers for edition files and spine files, as well as in preparing a proper TEI ODD for the project, but I think this will come easily for us as the code is "settling down." I want to make sure we credit all our student assistants, and also represent everyone involved in the multiple stages of this project from the beginning to now. This is just a starting point.