Adding word-boundary markup to Pgh-SGA Notebook XML - Githubissues

FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum

https://frankensteinvariorum.github.io/fv-collation/

GNU Affero General Public License v3.0

9 stars 2 forks source link

Adding word-boundary markup to Pgh-SGA Notebook XML #27

Closed ebeshero closed 6 years ago

ebeshero commented 7 years ago

@Rikkm I've now prepared our workspace in the Text_Processing branch for adding word-boundary markup to those words broken at the ends of lines. And I've prepared and applied a schema to help guide our markup and prevent mistakes.

First, the files to for us to work on are these (in Text_Processing branch): 1) That first full notebook witness of the entire novel (c56 and c57): https://github.com/ebeshero/Pittsburgh_Frankenstein/blob/Text_Processing/collateXPrep/sga_Notebooks/msCollPrep_c56_c57.xml

2) That second small notebook witness with an alternate copy of the novel's ending (c58): https://github.com/ebeshero/Pittsburgh_Frankenstein/blob/Text_Processing/collateXPrep/sga_Notebooks/msCollPrep_c58.xml

We'll use a self-closing (empty) w element, and use it as a flag whenever we see a word broken around a line boundary. Place one right in front of the start of the broken word (no white space between the tag and the word), and give it an @ana="start". And place one at the end of the broken word (again no white space) and give it an @ana="end".

<line>preceding words <w ana="start"/>Natu</line>
<line>al<w ana="end"/> following words</line>

Here's a sample of how to mark the words when they're broken around line elements:

<line><del rend="strikethrough" xml:id="c56-0005.03">may appear my fate had been</del>
      <del rend="strikethrough">Chemist</del><anchor xml:id="c56-0005.01"/> <w ana="start"/>Natu</line>
    <line>ral<w ana="end"/> philosophy <del rend="strikethrough">has</del> is the genius that has</line>

As we see other stuff to try to mark, ask questions here and we'll figure out a way!

@raffazizzi (copying you here so you're in the loop. I've added a couple of rules to the ODD over here in the Pittsburgh repo--not intended for the SGA repo--just for us to add word boundary markup.)

ebeshero commented 6 years ago

@Rikkm @ebeshero will (decided 2017-09-11) work on marking word boundaries, and as we go, compile a list in Issue 28 of weirdly spelled words to normalize. @Rikkm takes the first "half" (c56) and @ebeshero the second (c57).

ebeshero commented 6 years ago

@Rikkm I've just split the msCollPrep_c56_c57.xml file into two separate files (msCollPrep_c56 and msCollPrep_c57. That should make it easier for the two of us to add in the word boundary markup we've been discussing (see explanation/example above in this GitHub issue), since we can work in separate files, hopefully without much issue with conflicts. As I recall from our meeting today, you'd be taking on c56 and I'll take on c57. (Reminder to us both: we're working on the Text_Processing branch for this.)

It would be wonderful if we can get that word boundary markup in by early October ahead of our next meeting, or at least have some large blocks of it done so we can experiment with collating some pieces in October. (That said, we're both busy and this could easily get away from us. We'll try--let's see how far we can get through this.)

As we're working, we can be adding words to normalize to our list in Issue 28.

ebeshero commented 6 years ago

@Rikkm Just a quick note to say I'm working on a round of this markup over the next few hours today. Will report if I spot anything strange!

ebeshero commented 6 years ago

@Rikkm Houston, we have a problem! You're VERY likely to see this, too. I'll open a new issue about it: it's to do with the way margin notes are added--out of sequence with the flow of the text. I think we can simply move stuff to its proper location. I'll open a new issue and show some code.

ebeshero commented 6 years ago

Note to self (@ebeshero) : I'm leaving this issue open until I've finished my part of the word boundary markup.

ebeshero commented 6 years ago

To complete documentation on this issue: The problem I raised here on 2017-09-24 about margin zones out of sequence with the semantic flow of the text was resolved as described in https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/29 . We have a sequence of XSLT transformations to run to "migrate" the zones to their associated insertion points:

1) https://github.com/ebeshero/Pittsburgh_Frankenstein/blob/master/collateXPrep/sga_Notebooks/Id_Trans-sga-MarginZonesP1.xsl

2) https://github.com/ebeshero/Pittsburgh_Frankenstein/blob/master/collateXPrep/sga_Notebooks/Id_Trans-sga-MarginZonesP2.xsl

3) https://github.com/ebeshero/Pittsburgh_Frankenstein/blob/master/collateXPrep/sga_Notebooks/Id_Trans-sga-MarginZonesP3.xsl

ebeshero commented 6 years ago

Here is a summary of changes we are making in our "Pittsburgh bridge" edition of the S-GA notebook files for the purposes of collation. All such work is taking place in collateXPrep --> sga_notebooks directory.

1) We are marking words broken at ends of lines using <w> elements as "milestone markers" with <w ana="start"/> and <w ana="end"/>. When there are hyphens marking broken words, we are deleting these (as pseudomarkup) so they do not interfere with collation. For the purposes of collation the construction <w ana="start"/>emu-<lb/>late<w ana="end"/> is identical to <w ana="start"/>emu<lb/>late<w ana="end"/> and emulate. 2) Where necessary to disentangle positional transcription (as on c57-0031.10 or http://shelleygodwinarchive.org/sc/oxford/ms_abinger/c57/#/p31 ), we are manually moving interlinear additions broken and disrupted by lineation into sequence to prioritize the semantic flow of the text. (These are alterations not marked in margin zones, for which see 3 below.) All such manual alterations are documented on location in dated XML comment tags in the files msCollPrep_c56.xml, msCollPrep_c57.xml, and msCollPrep_c58.xml 3) With an XSLT pipeline we are migrating the markup transcribing margin zones into position at their marked insertion points. (This is facilitated by S-GA's location indicators.) 4) With another XSLT we are "flattening" <line>....</line> elements to <lb/> elements, to preserve these for information about their location. We also run XSLT to "flag" the specific location of each line on its page to facilitate stand-off intersection with S-GA files.

ebeshero commented 6 years ago

1816 Frankenstein MS notebooks: Summary of tasks completed:

[x] c56: word boundaries marked and collation units defined. Line markup is flattened and collation units are "chunked" with XSLT and used in collation.
[x] c57: word boundaries marked and collation units defined ONLY through C21.
[ ] c57: need to finish, flatten line markup, and chunk into collation units with XSLT. NOTE a possible need to resequence some pages for the purposes of collation. (Note: c56 and c57 are one more or less continuous witness with gaps.)
[ ] c58: need to begin marking word boundaries and collation units (treating this as a separate witness).