Closed ebeshero closed 5 years ago
Or should I just wait...and we do this after ADHO? Let me know what you think, @raffazizzi .
@djbpitt I want to retrieve Levenshtein calculations from the normalized tokens during the collation process. I imagine such calculations are already running in the CollateX algorithm—and just as we were able to add information that witnesses are “invariant” where their normalized tokens all agree, shouldn’t we also be able to find the maximum L. distance among each pair-wise comparison of the witnesses, and add that in an attribute in the output XML collation?
@djbpitt Alternatively, I imagine outputting the normalized tokens and calculating the L. distances independently, as a separate step from the collation...I'm just wondering if it's feasible and better to do this all in the same step, since the destination is the output XML anyway.
Just recording the results of my chat with @djbpitt here: No, we can't pull the Levenshtein calcs from the collateX process, but this is definitely the right stage for retrieving them--and I can develop this in the same python script.
I'm revisiting my code for prepping the S-GA files and see exactly what I did to remove what turns out to be a subset of the mod/del
elements: Here's where I did it, with a couple of other changes:
See line 31. @raffazizzi : I can run the collation again, just on unit 10 (or the whole thing) and really just keep those modified deletions coded as <mdel>
. I was thinking I could do two collation passes, just on our "unit test" of C-10 at first:
1) First output: I might just treat them the same as all the other deletions and process them with the collation--see if it makes a difference. (I suspect I did this once last spring and we'll discover the reason I removed them, but it's worth taking another look.)
2) Second output: I'll mask the <mdel>
elements, meaning they'll be present but ignored during the collation, but still output so we'll have them. This second output should produce the same collation (and if it doesn't there will be something wrong and we'll have questions for the collateX developers...)
(By the way, I'm writing from the sky: I bought an internet pass for this flight so I have GitHub access all afternoon!)
Closing this, as we now output collation data in a new format, and have a method for calculating Levenshtein values.
@raffazizzi I woke up with a better idea of what to do about the
<mod>....<del>..</del><add>....</add></mod>
issue. I'd been (as I wrote yesterday) removing just those<del>
elements from the collation files because they were apparently all very short and added a lot of complexity to the processing. (They're complex because we want to keep long<del>........</del>
elements in case they're comparable spans of text to stuff preserved in the novel, so we don't want to just ignore all deletions in the collation. But I wanted to ignore these because they were tiny and incidental, as in lots of false starts and tiny emendations--so I kept inside the mod elements only the positive, undeleted and added material.) Since we need the record of the characters here for you to construct the pointers, I should put those dels back, or find a way to screen them in the collation, and here's my idea-- I think I can implement for C-10 today if there's still time!1) In my prep of S-GA files for collation, I preserve and distinguish the
del[parent::mod]
elements as something likemDel
(easy to see as a weird element name). I add it to the Python script to mask out of the collation tokens.2) I might as well be doing the Levenshtein calculations we talked about in that pass as well, to base the distance calculations on the normalized collation tokens and not the literal output. And then we can try to return that info, the max Levenshtein calc, as an attribute on the
<app>
element.The second part of this means getting in and modifying the collateX algorithm for its XML output to add that attribute (like we did to add
type="invariant"
). We did this before...I wonder if I can manage to do this on the trip out, just for C-10 so we can see it?