FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

TSV Files for Analysis #32

Closed ebeshero closed 5 years ago

ebeshero commented 6 years ago

Hi @scottbot and @mjlavin80 ! I've finally output those TSV files you requested earlier this month, and you should take a look in this directory on the master branch of this repo: https://github.com/ebeshero/Pittsburgh_Frankenstein/tree/master/analysis

NOTE: These files aren't exactly what you asked for, but they're close. You asked for:

  1. Comparisons of additions/removals: 6 plaintext files. Each newline contains a new chunk of text that was added between the earlier and later edition. These files only show text that was added, excluding all text that remained the same.
  2. Manuscript vs 1stEd
  3. Manuscript vs 2ndEd
  4. Manuscript vs 3rdEd
  5. 1stEd vs 2ndEd
  6. 1stEd vs 3rdEd
  7.  2ndEd vs 3rdEd 6 plaintext files. Each newline contains a new chunk of text that was removed between the earlier and later edition. These files only show text that was removed, excluding all text that remained the same.
  8. Manuscript vs 1stEd
  9. Manuscript vs 2ndEd
  10. Manuscript vs 3rdEd
  11. 1stEd vs 2ndEd
  12. 1stEd vs 3rdEd
  13.  2ndEd vs 3rdEd
  14. Percy's vs Shelly's hand (based on all Shelley-Godwin manuscript XML from their github page). On the Shelley-Godwin archive page & github, there are a bunch of manuscripts/notes in xml files that mark Percy vs Mary's hand.  We need: -A big file with everything written in Percy's hand -A big file with everything written in Mary's hand

Working backwards: 1) You've got a nice file of everything in Percy's hand: it's a TSV in which I output the tagname, then a tab, and then the text. Sometimes there were nested tags inside, so I just kept outputting the tagnames and text on separate lines in case the element names are useful (usually indicating insertions and deletions of various kinds).

2) For Mary, I tried spitting out a file of everything that isn't in Percy's hand, and it's messier because there's a lot more of it in the ms notebook files. I've only output a file representing the ms notebooks here. Note: For future consideration, a small complicating factor in hand identification is that in later editions (i.e. the 1823), Mary's father, William Godwin, made some small edits--we don't have his hand distinguished from Mary's in this project. (That'd require a trip to the archives...) From what I'm seeing of the collation output from the 1823 and 1818 texts, there really don't seem to be many interesting differences between these two editions.

3) You wanted 12 TSV files, 6 to represent additions, and 6 to represent removals. I tried wrapping my head around this, but settled on 6 output files that in some respect show both. In practice I'm not sure I can locate additions AND removals separately--it's complicated because of a) the lumpy tagging in the ms notebook XML, in which we see add and delete elements sometimes in sequence, and b) because when we collate with the ms notebooks, the critical apparatus becomes super small--often around a word or two, and those words might be inside a deletion or insertion that's much longer. Also c) @Rikkm and I discovered there are some long passages of apparently deleted text (i.e. text with a big line drawn down the middle of it) that show up in the 1818 edition as if they weren't deleted at all.

That said, I think we could parse the 6 TSV files representing comparisons of each edition to one other, to identify where things are missing in one edition but present in the other. I think you may want to inspect what I've output first and we should look this over together when we next meet.

I created the group of 6 "comparison TSVs" from distinct collations--that is, I ran collateX over just two files at a time to process the output. I think you'll see why I did that as you look at the output: The collation algorithm chunks the comparison segments in tiny atomized pieces when the ms notebook is involved, but chunks the comparisons in longer batches when the print editions are compared with one another. I think these TSVs are pretty illuminating of which editions most closely resemble one another!