Closed ebeshero closed 5 years ago
Note: I need to weave this normalization list of weirdly spelled words into the Python script, too: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/issues/28
Action on @ebb: Output fresh DETAILED collation with location tags (don't suppress the <lb/>
elements) now that I've properly normalized ampersands.
@raffazizzi @Rikkm I'm running a fresh collation this morning. As I mentioned in our meeting on Tuesday, this collation will have better alignment because it's properly normalizing ampersands and markup. Also, I've made sure the <lb>
elements are present so the text locations are clearly signaled.
C-10 is freshly output already in the Full_xmlOutput directory you've been working in, Raff: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/Full_xmlOutput
It's also sitting by itself here in my unit-testing folder: C10_xmlOutput--that might be an easier place to work with it by itself: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/C10_xmlOutput
(As usual I've done some reorganizing to put old collation stuff away.)
Following discussion with @raffazizzi today, I've added more normalization to a new version of the Python script. The normalization function now does two things: 1) it ignores the content of any element tags during the collation process (so
<p/>
elements with distinctly different "location markers" in their attributes can still align and be considered invariant), 2) it lowercases all the input that collateX works with.The new version of the Python script also ignores most markup. I decided for now to keep
<p/>
elements as inline-empty because a change in paragraphing is semantically significant. For the same reason I kept div markers, add and del. I'm thinking we might just want to output del, but suppress add.The new output is coming out in the LessMarkupV2_xmlOutput directory.
I'll run a few more variations on this theme. Question: Is it necessary/useful to have "location flag" attributes appear in the collation output on the
<p/>
and<lb/>
elements?