calculate variance data

ebeshero commented 6 years ago

Think about how best to do this, with destination = reading interface and highlighting of variants with "hot" to "cool" colors from most to least significant.

Levenshtein distance (calculated by character) seems like a good way to group variants in ranges...HOWEVER: What if we have a variant in which one letter difference makes two completely different semantic words? (e.g. code vs. cove)?

To deal: test for equality of word tokens when these are not in the normalization list.

Work with assessment of word tokens AND the Levenshtein distances to estimate the range of significance of a variant.

ebeshero commented 5 years ago

I'm returning to this issue right now, as we're preparing finalized "spine" files holding small portions of variant text. The question is, what is the best way to calculate Levenshtein distances for our project? I think we should experiment with a weighted-Levenshtein approach that allows us to adjust edit-distance scores for distinct kinds of edits, but we need to think about how we want to do this--what makes sense for the kinds of edits we are seeing in Frankenstein.

Basically the weighted Levenshtein approach lets you score edit distances differently, if you want, if they detect changes based on:

insertion (add a character)
deletion (delete a character)
substitution (replace one character for another)
transposition (rearrange characters)

So, we've seen examples of all of these things, especially in looking at comparisons of the manuscript notebooks to the 1818 edition. The later editions make changes that are more like bulk additions and deletions. But the movement from manuscript to print seems to involve a lot of small changes to spelling, additions or deletions of punctuation, etc. And the question is, do we want to make some decisions about scoring some of these changes as more significant than others in evaluating intensity of variation?

Reality check: We have two use-cases for Levenshtein calculation: 1) to control the intensity of hue in highlighting hotspots 2) to help us build infographics surveying which portions of the edition contain the most significant "molten" alterations. In these terms, it may not matter very much whether/how we weight a Levenshtein distance calculation. But I'd appreciate the wisdom of the group and those watching our repo on this--some of you have been working with edit distances longer than I have!

Here's documentation on a Python implementation of weighted-levenshtein that I'm likely to follow: https://weighted-levenshtein.readthedocs.io/en/master/

@raffazizzi @scottbot @mdlincoln @djbpitt

ebeshero commented 5 years ago

My idea right now is that that transpositions should be scored less than additions and deletions. We see so much simple reversal of letters in the MS that it doesn't seem as semantically significant. I'm not so sure about substitutions--maybe we should keep adds, deletes, and substitutions at the same level, and score transpositions lower?

raffazizzi commented 5 years ago

I think blatant corrections (by deletion, addition, transposition, or a combination of these) in the ms (eg. mano^uscript) should not be scored, or scored lowest since they may still appear as a "variant" reading (even though, truth be told, they're not).

On the other hand, changes in words that got discarded could be scored higher, even as much as a regular variant. For example "The ~~swift~~ ^quick brown fox" is interesting even if all texts agree on "quick" because it shows the genesis of the work (I bet there is terminology regarding these phenomena in the critique génétique literature, but how far do we want to go in terms of rigor?).

It's tempting to use Levenshtein values to get an idea of variance at character level, but is that useful? I wonder if it might be best to have a handful of categories of variance and use the Levenshtein analysis to assign variants to these categories. Each category could correspond to a range of Levenshtein values and if we need to make manual corrections or judgement calls, we just pick the median value in that range.

ebeshero commented 5 years ago

Note presence of <del> tags as a factor...

ebeshero commented 5 years ago

After meeting with @raffazizzi online today, we're agreed that we can take the median Levenshtein values calculated for each <app> and map those back to the spine files. However, a future project may be to pull more precise locational data for each edition into the variorum--so that when a viewer is looking at a specific (1831) edition, for example, the Levenshtein score controlling the color of the highlight in that edition can be based on the user's choice of a specific pairwise comparison: 1831 might compare differently to the MS and 1818 than it does to the Thomas, or to the 1823 edition, and if each of those pairwise comparisons resulted in a different value, we can use that specific localized comparison value in the interface. For now, we'll work with a single median value for each app representing a mid-range for all of available rdgGrp comparisons.

Yes (as Raff indicates above), we'd want the values to help us identify interesting moments of overwriting, and we may want to revisit our algorithm for weighting the Levenshtein values. (At this point, I've weighted transpositions (simply switching the order of letters) from MS to 1818 as specifically 0.25 the value of all other edits.)

In real terms, Levenshtein values are simply not accurate in many places. Levenshtein is a computation of comparison based on order of letters in sequence (and it is running "under the hood" in collateX to help us create our alignments of witnesses and rdgGrps, showing where witnesses line up). It is inaccurate in cases of marked deletions (marked with <del> tags in the MS and Thomas editions), because it will see the <del> characters in the tags as small additions to the MS and not as major moments of intervention. On the other hand, I've been adjusting for that in refining our collation output: I've been bundling a full deleted passage together inside a single <app> and combining it with the added replacement that MWS wrote in the margins. So those characters DO count as a massive addition, and Levenshtein does pick them up. In these cases, well, actually, Levenshtein is pretty good!

I think it provides us a "ballpark" view, or a metric that's sufficient to point the way to the most intensive alterations in the text. When combined with our interventions in the collation output (moving/altering misaligned apps, positioning fully struck-out passages together in a single <app>), it is probably a pretty good measurement of character-by-character change. That said, we should plan to keep tinkering with it.

ebeshero commented 5 years ago

Quick update on this issue: We decided at our meeting with @scottbot, @Rikkm, and @raffazizzi yesterday that maximum or average Levenshtein calculations are more appropriate for our use. My preference is the maximum value since it will likely help to make the strongest variations stand out the most.

I have now output those maximum Levenshtein values and applied them to our standoff_Spine directory for use in the Variorum interface, here: https://github.com/PghFrankenstein/fv-postCollation/tree/master/standoff_Spine (This is also posted in the fv-data repo, which just holds our finalized files for the Variorum).

The more complete variant data on each rdgGrp's pairwise comparison with other rdgGrps in its app location is available in tidied up form here: https://github.com/PghFrankenstein/fv-postCollation/blob/master/edit-distance/FV_LevDists-weighted.xml

ebeshero commented 5 years ago

In the standoff_Spine directory, the Levenshtein maximum value for each <app> element is posted in its @n attribute.

FrankensteinVariorum / fv-collation

calculate variance data #37