Closed ebeshero closed 5 years ago
@raffazizzi Rethinking this, since one of my goals is to clean up the repo and write clear documentation of what we're doing--a problem with the doubled @xml:id
s is simply that they generate confusion later. Likely we'll have some help with adding/correcting <w>
(word boundary) markup in these msCollPrep
files in the coming months, and people coming on board to help with our project are likely to get confused! I think it's a small thing to add some kind of indicator that these three surfaces are cut parts of a whole, and in generating the @n
values for the <lb/>
elements, just get the substring before the new indicator...
...and a more realistic line number, too, based on the info we recorded in those <gap>
elements.
With this commit: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/commit/c5f6e37a303d32cbeb7089637e3764746c34fded , I revised the line numbers so that, in the event of a preceding-sibling <gap reason="resequencing" unit="lines">, we add the
@quantity` to the line count. This should mean that the line number will match its XPath position in the source S-GA file!
I think this now resolved...closing.
@raffazizzi I just noticed something I think we've not been doing right with the line locations (
@n
attributes) for some of the S-GA files. Do you remember how in S-GA c-56, there's a long early deletion and resequencing marked in the ms? See http://shelleygodwinarchive.org/sc/oxford/frankenstein/volume/i/#/p3/mode/xmlBack in October 2017, we carefully worked out how the text was marked to "flow" in reading order in our file here: collateXPrep/sga_Notebooks/msCollPrep_c56.xml
This affects three
<surface>
elements which are effectively split into two parts each by our pre-collation resequencing. So we have two sets of<surface>
elements sharing the same@xml:id
forox-ms_abinger_c56-0013
,ox-ms_abinger_c56-0009
, andox-ms_abinger_c56-0015
For whatever reason, I opted not to change these
@xml:id
s to indicate something like Part-1-of-2, but I don't remember why I never changed them. We may have chosen to live with duplicate@xml:ids
on these doubled<surface>
elements because they do literally point back to the correct file, AND because during the collation process we "drop out" the<surface>
elements. They don't show up at all in the<app><rdg wit="msColl">....</rdg></app>
output, and in their place, we have the@n
on each<lb/>
identifying the surface and line position.Currently we derive our
@n
attributes for each<lb/>
from eachsurface/@xml:id
and a simple count of the line children inside. Does this cause a problem as we're looking at generating pointers and relying for help on the@n
attributes to help locate the position on the file? It might not because we also set<gap>
elements to indicate when lines or character tokens follow after any kind of displacement. Here's an example from c56-0013:We also have other kinds of
<gap>
elements to mark character displacement, and I am pretty sure you're looking for these to help construct your pointers. My question is, in the case of lines, shouldn't I be calculating a line number from these three surfaces (c56 0013, 0009, 0015) based on the gap information?In case you're already doing that and I'd better not, let me know. But I don't see that it could hurt and since we've only been seriously testing one or two collation units at a time for constructing pointers until now, I think this is one of those big picture questions for stitching together the full collation of the novel! Let me know what you think. For right now I'm considering: a) leave those
<surface>
elements alone rather than add to their@xml:ids
--that is, live with their invalidity inmsCollPrep_c56.xml
, and b) try correctly calculating the line numbers when they come from split surfaces (if there's a gap with@unit="line"
add the value in@quantity
to the line-count that appears in@n
).This is sort of revisiting the issue of an old ticket here: [#41])(#41 )