FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

S-GA line location @n on resequenced surfaces #57

Closed ebeshero closed 5 years ago

ebeshero commented 6 years ago

@raffazizzi I just noticed something I think we've not been doing right with the line locations (@n attributes) for some of the S-GA files. Do you remember how in S-GA c-56, there's a long early deletion and resequencing marked in the ms? See http://shelleygodwinarchive.org/sc/oxford/frankenstein/volume/i/#/p3/mode/xml

Back in October 2017, we carefully worked out how the text was marked to "flow" in reading order in our file here: collateXPrep/sga_Notebooks/msCollPrep_c56.xml

This affects three <surface> elements which are effectively split into two parts each by our pre-collation resequencing. So we have two sets of <surface> elements sharing the same @xml:id for ox-ms_abinger_c56-0013, ox-ms_abinger_c56-0009, and ox-ms_abinger_c56-0015

For whatever reason, I opted not to change these @xml:ids to indicate something like Part-1-of-2, but I don't remember why I never changed them. We may have chosen to live with duplicate @xml:ids on these doubled <surface> elements because they do literally point back to the correct file, AND because during the collation process we "drop out" the <surface> elements. They don't show up at all in the <app><rdg wit="msColl">....</rdg></app> output, and in their place, we have the @n on each <lb/> identifying the surface and line position.

Currently we derive our @n attributes for each <lb/> from each surface/@xml:id and a simple count of the line children inside. Does this cause a problem as we're looking at generating pointers and relying for help on the @n attributes to help locate the position on the file? It might not because we also set <gap> elements to indicate when lines or character tokens follow after any kind of displacement. Here's an example from c56-0013:

<gap reason="resequencing" unit="lines" quantity="28"/>

We also have other kinds of <gap> elements to mark character displacement, and I am pretty sure you're looking for these to help construct your pointers. My question is, in the case of lines, shouldn't I be calculating a line number from these three surfaces (c56 0013, 0009, 0015) based on the gap information?

In case you're already doing that and I'd better not, let me know. But I don't see that it could hurt and since we've only been seriously testing one or two collation units at a time for constructing pointers until now, I think this is one of those big picture questions for stitching together the full collation of the novel! Let me know what you think. For right now I'm considering: a) leave those <surface> elements alone rather than add to their @xml:ids--that is, live with their invalidity in msCollPrep_c56.xml, and b) try correctly calculating the line numbers when they come from split surfaces (if there's a gap with@unit="line" add the value in @quantity to the line-count that appears in @n).

This is sort of revisiting the issue of an old ticket here: [#41])(#41 )

ebeshero commented 6 years ago

@raffazizzi Rethinking this, since one of my goals is to clean up the repo and write clear documentation of what we're doing--a problem with the doubled @xml:ids is simply that they generate confusion later. Likely we'll have some help with adding/correcting <w> (word boundary) markup in these msCollPrep files in the coming months, and people coming on board to help with our project are likely to get confused! I think it's a small thing to add some kind of indicator that these three surfaces are cut parts of a whole, and in generating the @n values for the <lb/> elements, just get the substring before the new indicator...

ebeshero commented 6 years ago

...and a more realistic line number, too, based on the info we recorded in those <gap> elements.

ebeshero commented 6 years ago

With this commit: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/commit/c5f6e37a303d32cbeb7089637e3764746c34fded , I revised the line numbers so that, in the event of a preceding-sibling <gap reason="resequencing" unit="lines">, we add the@quantity` to the line count. This should mean that the line number will match its XPath position in the source S-GA file!

ebeshero commented 5 years ago

I think this now resolved...closing.