FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization:
GNU Affero General Public License v3.0
9 stars 2 forks source link

S-GA line location @n on resequenced surfaces #57

Closed ebeshero closed 5 years ago

ebeshero commented 6 years ago

@raffazizzi I just noticed something I think we've not been doing right with the line locations (@n attributes) for some of the S-GA files. Do you remember how in S-GA c-56, there's a long early deletion and resequencing marked in the ms? See

Back in October 2017, we carefully worked out how the text was marked to "flow" in reading order in our file here: collateXPrep/sga_Notebooks/msCollPrep_c56.xml

This affects three <surface> elements which are effectively split into two parts each by our pre-collation resequencing. So we have two sets of <surface> elements sharing the same @xml:id for ox-ms_abinger_c56-0013, ox-ms_abinger_c56-0009, and ox-ms_abinger_c56-0015

For whatever reason, I opted not to change these @xml:ids to indicate something like Part-1-of-2, but I don't remember why I never changed them. We may have chosen to live with duplicate @xml:ids on these doubled <surface> elements because they do literally point back to the correct file, AND because during the collation process we "drop out" the <surface> elements. They don't show up at all in the <app><rdg wit="msColl">....</rdg></app> output, and in their place, we have the @n on each <lb/> identifying the surface and line position.

Currently we derive our @n attributes for each <lb/> from each surface/@xml:id and a simple count of the line children inside. Does this cause a problem as we're looking at generating pointers and relying for help on the @n attributes to help locate the position on the file? It might not because we also set <gap> elements to indicate when lines or character tokens follow after any kind of displacement. Here's an example from c56-0013:

<gap reason="resequencing" unit="lines" quantity="28"/>

We also have other kinds of <gap> elements to mark character displacement, and I am pretty sure you're looking for these to help construct your pointers. My question is, in the case of lines, shouldn't I be calculating a line number from these three surfaces (c56 0013, 0009, 0015) based on the gap information?

In case you're already doing that and I'd better not, let me know. But I don't see that it could hurt and since we've only been seriously testing one or two collation units at a time for constructing pointers until now, I think this is one of those big picture questions for stitching together the full collation of the novel! Let me know what you think. For right now I'm considering: a) leave those <surface> elements alone rather than add to their @xml:ids--that is, live with their invalidity in msCollPrep_c56.xml, and b) try correctly calculating the line numbers when they come from split surfaces (if there's a gap with@unit="line" add the value in @quantity to the line-count that appears in @n).

This is sort of revisiting the issue of an old ticket here: [#41])(#41 )

ebeshero commented 6 years ago

@raffazizzi Rethinking this, since one of my goals is to clean up the repo and write clear documentation of what we're doing--a problem with the doubled @xml:ids is simply that they generate confusion later. Likely we'll have some help with adding/correcting <w> (word boundary) markup in these msCollPrep files in the coming months, and people coming on board to help with our project are likely to get confused! I think it's a small thing to add some kind of indicator that these three surfaces are cut parts of a whole, and in generating the @n values for the <lb/> elements, just get the substring before the new indicator...

ebeshero commented 6 years ago

...and a more realistic line number, too, based on the info we recorded in those <gap> elements.

ebeshero commented 6 years ago

With this commit: , I revised the line numbers so that, in the event of a preceding-sibling <gap reason="resequencing" unit="lines">, we add the@quantity` to the line count. This should mean that the line number will match its XPath position in the source S-GA file!

ebeshero commented 5 years ago

I think this now resolved...closing.