FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

Inconsistent xml:ids vs n attributes in S-GA transformation to collation XML #41

Closed ebeshero closed 6 years ago

ebeshero commented 6 years ago

@raffazizzi observes: some <lb/> elements have unpredictable @n attributes that probably come from the <line>'s xml:id. Can we switch to still generating a regular @n and perhaps keep the @xml:id as@xml:id?

For example:

<line> --> <lb n="ms-surface_zone_linenum" />

<line xml:id="ID"> --> <lb n="ms-surface_zone_linenum" xml:id="ID" />

@ebeshero: to investigate how we're generating these and take a look at the input: where's the inconsistency coming from? Are @xml:ids always present--and how are we deriving them? I'm thinking the inconsistency might be coming from the lines within marginal text.

ebeshero commented 6 years ago

@raffazizzi Here's my ID transformation stylesheet where we're adding the "page location flags" to each line: https://github.com/ebeshero/Pittsburgh_Frankenstein/blob/master/collateXPrep/Id_Trans_sgaMSLocators.xsl

And here's a sample of the input code (with S-GA <line> elements) for reference: https://github.com/ebeshero/Pittsburgh_Frankenstein/blob/master/collateXPrep/sga_Notebooks/msCollPrep_c57.xml

The S-GA markup doesn't have @xml:ids on the lines, but instead they're on <surface> and <zone> and <anchor>: we derived the value of @n from those elements, and I stuck a number at the end based on count of preceding-sibling line. We can easily adjust this, but the inconsistencies, I think, have to do with whether the <line> was inside a margin zone or in a main page zone.

raffazizzi commented 6 years ago

@ebeshero you're indeed correct the issue had to do with which zone it came from. I made some adjustments to Id_Trans-sgaMSLocators.xsl to include the correct name of the zone (https://github.com/ebeshero/Pittsburgh_Frankenstein/commit/44f1d85243ae75f7bacd5900db52387c16e21aa0). I also adopted __ as separators instead of _ because the latter can occur in zone types (left_margin).

Would you be able to run the script and regenerate collation chunks?

ebeshero commented 6 years ago

@raffazizzi Huzzah! Glad we figured out the issue--Okay, for me to re-run the collation will take several hours (seriously--it's a time-consuming process at this point with comparing all five editions for all 33 units). Would it help you if I sent you a selection of collation units in the next hour or two? (Or is it okay if you get the full collation again like tomorrow AM?)

raffazizzi commented 6 years ago

Yes to both! I can work on this more tomorrow afternoon. I've been experimenting with chunk 15, but possibly any other chunk including manuscript material will work for me! Thanks :grin:

ebeshero commented 6 years ago

Ahh--sorry--just starting it now (= meeting ran long). I’ll run collation unit 15 first and push to GitHub.

ebeshero commented 6 years ago

@raffazizzi Just to be clear--right now do you ONLY need the collation chunks, rather than the collation itself to be re-run? Well, I'll push the chunks to be collated in a few minutes--for the entire novel!

I'll also start reprocessing a full collation so we have output, too. But I'll start that with just collation unit 15.

raffazizzi commented 6 years ago

I need the collation itself so that I can work on converting rdgs with SGA content to rdgs with pointers to SGA.

ebeshero commented 6 years ago

@raffazizzi Got it... processing...stay tuned!

raffazizzi commented 6 years ago

These files: https://github.com/ebeshero/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/Full_xmlOutput

ebeshero commented 6 years ago

@raffazizzi I just a) ran your new identity transformation to set the new collation flags for S-GA, and then b) produced fresh collation units from that. Then I c) re-ran the collation chunking to make just C15, the one you were working on. That's all ready now from here: https://github.com/ebeshero/Pittsburgh_Frankenstein/tree/133011d3ba910bd01ca6326b02f3843cfde823b1/collateXPrep/Full_xmlOutput_C15

I'm starting the full collation now, and you'll have a fresh load of collated files to work with by tomorrow.

ebeshero commented 6 years ago

@raffazizzi There's a complete new collation set now for you to work with here: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/Full_xmlOutput

Note: The collation output actually isn't quite complete yet. There's a little cluster of fragment files from s-ga that require special collation with the rest of the set. So, for example, for collation unit 20, there's a little fragment in the Bodleian c57 that is a reworking of some pages also in c57, so I generated it as a separate fragment file. That's true of four or five other collation units, and really c58 is another "frag" witness. I've prepared a special directory to generate another set of collations to work in these fragmentary witnesses together with all the others. These will have more rdg witnesses than the others--I'll prep the collation and run it later today.