FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

Flattening Stuff Even More: re Massive error in chunking SGA-MS files for Collation #46

Closed ebeshero closed 5 years ago

ebeshero commented 6 years ago

@raffazizzi This afternoon I discovered, to my horror, a massive problem in our most recent set of collation files. We have duplicated material in each collation "chunk." (You may have noticed this, too.) I've now figured out what caused the problem--and it has to do with leaving too much XML hierarchy in place in the S-GA files before "chunking" them into separate collation units. We have some collation units that begin in the middle of a page, and because each page is still held in an intact <zone> element (carrying children and descendants), that's causing problems. I was foolishly chunking along the following:: axis, but of course that can't produce good output b/c my "chunks" aren't being broken off at the same hierarchical level. If I'm generating chunks, I basically have to do it at the same level, using the following:sibling:: axis for the chunks to come out in good shape, without duplicate material. (I have no idea why the output comes out that way, with duplicate material instead of outright errors that say a group created from elements along the following:: axis can't possibly be well-formed, but at least I know what caused the problem!)

The solution is...that I need to flatten the files from S-GA still further, after we've planted the "flags" for the @n values. I think this is likely not going to make much of a change for you, because you're working with the output of the collation, where all the element content from S-GA gets "flattened" anyway. I think you'll also likely see fewer strange duplications, too--I'm sure you ran into some strangeness as you were working!

I'm going to try flattening everything so all my collation anchors are at the same sibling level, and reprocess the chunks, and reprocess the collation. I'll ping you when that's done! Rather than overwrite the current collation files you've been working with, I'll just rename that directory as "flawed" but leave it in place for the moment in case we need it for anything.

ebeshero commented 6 years ago

@raffazizzi Since I'm running a fresh collation, is there anything else you can think of that I should change in the collation chunks?

ebeshero commented 6 years ago

@raffazizzi I'm setting down to work on this now, and decided I'd really better flatten everything a little more and set more signals of text location in the other files representing the published Frankenstein volumes. Here's what I'm doing:

I'm adding some adjustments to flatten the <p> elements in those files and signal their location in the text in attributes, like this:

<p loc="Vol1_letter1_n5--Start"/>text of the paragraph...  <p loc="Vol1_letter1_n5--End"/>

This way, when the <p> tags are separated from one another in the output collation, we can reconstitute them on the other side by finding where the strings match.

I need to do similar things with the elements signalling page boundaries in the S-GA notebooks, because some of the collation unit boundaries in c-57 are (inevitably) falling inside page zones.

ebeshero commented 6 years ago

Resolved! All files (across ALL input files) are freshly flattened. <p/> elements are now flat, as are <surface/>, <zone/>, <mod/>, <add/>, and <del/>. All of these have new attributes to help with distinctly identifying and mapping to their locations in the text. A new collation process is now running, with output files here in the Text_Processing branch: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/Full_xmlOutput

I think I've resolved the problem with generating the collation chunks, but I'm leaving this issue open for comment since @raffazizzi will need to look at new things in the output files.

raffazizzi commented 6 years ago

Looking good on C07.

ebeshero commented 6 years ago

I think the collation looks a lot cleaner now--let me know if you see anything you'd like to change in the source files going in. All 33 of the main collation chunks are now processed. I still need to process the extra short fragments (with c58 to c57, for example).