FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

Dealing with dels-in-mods for pointers and Levenshtein calculations #55

Closed ebeshero closed 5 years ago

ebeshero commented 6 years ago

@raffazizzi I woke up with a better idea of what to do about the <mod>....<del>..</del><add>....</add></mod> issue. I'd been (as I wrote yesterday) removing just those <del> elements from the collation files because they were apparently all very short and added a lot of complexity to the processing. (They're complex because we want to keep long <del>........</del> elements in case they're comparable spans of text to stuff preserved in the novel, so we don't want to just ignore all deletions in the collation. But I wanted to ignore these because they were tiny and incidental, as in lots of false starts and tiny emendations--so I kept inside the mod elements only the positive, undeleted and added material.) Since we need the record of the characters here for you to construct the pointers, I should put those dels back, or find a way to screen them in the collation, and here's my idea-- I think I can implement for C-10 today if there's still time!

1) In my prep of S-GA files for collation, I preserve and distinguish the del[parent::mod] elements as something like mDel (easy to see as a weird element name). I add it to the Python script to mask out of the collation tokens.

2) I might as well be doing the Levenshtein calculations we talked about in that pass as well, to base the distance calculations on the normalized collation tokens and not the literal output. And then we can try to return that info, the max Levenshtein calc, as an attribute on the <app> element.

The second part of this means getting in and modifying the collateX algorithm for its XML output to add that attribute (like we did to add type="invariant"). We did this before...I wonder if I can manage to do this on the trip out, just for C-10 so we can see it?

ebeshero commented 6 years ago

Or should I just wait...and we do this after ADHO? Let me know what you think, @raffazizzi .

ebeshero commented 6 years ago

@djbpitt I want to retrieve Levenshtein calculations from the normalized tokens during the collation process. I imagine such calculations are already running in the CollateX algorithm—and just as we were able to add information that witnesses are “invariant” where their normalized tokens all agree, shouldn’t we also be able to find the maximum L. distance among each pair-wise comparison of the witnesses, and add that in an attribute in the output XML collation?

ebeshero commented 6 years ago

@djbpitt Alternatively, I imagine outputting the normalized tokens and calculating the L. distances independently, as a separate step from the collation...I'm just wondering if it's feasible and better to do this all in the same step, since the destination is the output XML anyway.

ebeshero commented 6 years ago

Just recording the results of my chat with @djbpitt here: No, we can't pull the Levenshtein calcs from the collateX process, but this is definitely the right stage for retrieving them--and I can develop this in the same python script.

I'm revisiting my code for prepping the S-GA files and see exactly what I did to remove what turns out to be a subset of the mod/del elements: Here's where I did it, with a couple of other changes:

https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/blob/Text_Processing/collateXPrep/sga_Notebooks/Id_Trans_sgaCollatePrep.xsl

See line 31. @raffazizzi : I can run the collation again, just on unit 10 (or the whole thing) and really just keep those modified deletions coded as <mdel>. I was thinking I could do two collation passes, just on our "unit test" of C-10 at first: 1) First output: I might just treat them the same as all the other deletions and process them with the collation--see if it makes a difference. (I suspect I did this once last spring and we'll discover the reason I removed them, but it's worth taking another look.) 2) Second output: I'll mask the <mdel> elements, meaning they'll be present but ignored during the collation, but still output so we'll have them. This second output should produce the same collation (and if it doesn't there will be something wrong and we'll have questions for the collateX developers...)

(By the way, I'm writing from the sky: I bought an internet pass for this flight so I have GitHub access all afternoon!)

ebeshero commented 5 years ago

Closing this, as we now output collation data in a new format, and have a method for calculating Levenshtein values.