FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

Refining the collation #48

Closed ebeshero closed 5 years ago

ebeshero commented 6 years ago

Following discussion with @raffazizzi today, I've added more normalization to a new version of the Python script. The normalization function now does two things: 1) it ignores the content of any element tags during the collation process (so <p/> elements with distinctly different "location markers" in their attributes can still align and be considered invariant), 2) it lowercases all the input that collateX works with.

The new version of the Python script also ignores most markup. I decided for now to keep <p/> elements as inline-empty because a change in paragraphing is semantically significant. For the same reason I kept div markers, add and del. I'm thinking we might just want to output del, but suppress add.

The new output is coming out in the LessMarkupV2_xmlOutput directory.

I'll run a few more variations on this theme. Question: Is it necessary/useful to have "location flag" attributes appear in the collation output on the <p/> and <lb/> elements?

ebeshero commented 6 years ago

Note: I need to weave this normalization list of weirdly spelled words into the Python script, too: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/issues/28

ebeshero commented 6 years ago

Action on @ebb: Output fresh DETAILED collation with location tags (don't suppress the <lb/> elements) now that I've properly normalized ampersands.

ebeshero commented 6 years ago

@raffazizzi @Rikkm I'm running a fresh collation this morning. As I mentioned in our meeting on Tuesday, this collation will have better alignment because it's properly normalizing ampersands and markup. Also, I've made sure the <lb> elements are present so the text locations are clearly signaled.

C-10 is freshly output already in the Full_xmlOutput directory you've been working in, Raff: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/Full_xmlOutput

It's also sitting by itself here in my unit-testing folder: C10_xmlOutput--that might be an easier place to work with it by itself: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/C10_xmlOutput

(As usual I've done some reorganizing to put old collation stuff away.)