Closed ebeshero closed 5 years ago
Here's a basic idea for the first, super-easy "spinal column" TEI output, holding just collation unit 10 for the moment: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/blob/Text_Processing/collateXPrep/C10_xmlOutput/teiUp-tester.xml
Hello!
Sounds good, though after we have a (curated) stand-off collation pointing to all the documents, we should be able to generate this TEI by resolving the pointers. (This could be a good test to prove that our approach works)
In the past I have used "semantically weak" elements (though my supervisor wasn't too happy with this choice of words) such as <w>
or <seg>
. To me <seg>
is the closest it gets to HTML's <span>
. I don't think we need to leave any information about other sources in there: the collation provides those. Real life example from Early Modern Songscapes:
L638_1.xml
<lg type="stanza">
<l><hi rend="italic" xml:id="test">THesus,</hi> O <hi rend="italic">Thesus,</hi> hark! but yet in vain;</l>
<l rend="indent2">A-las <w xml:id="v1">de–-ser-ted</w> I complain;</l>
<!-- etc. -->
</lg>
BL_53723.xml
<lg type="stanza">
<l>Theseous! ô theseus! heark! but yet in vaine,</l>
<l rend="indent2">alas <w xml:id="v1">forsaken</w> I Complaine;</l>
<!-- etc. -->
</lg>
Standoff collation
<app>
<rdg wit="#BL_53723">
<ptr target="BL_53723.xml#v1"/>
</rdg>
<rdg wit="#L638_1">
<ptr target="L638_1.xml#v1"/>
</rdg>
</app>
@raffazizzi @Rikkm I've written a lot of XSLT code today, and I think I've got some good output for us to work with from this commit: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/commit/0006ae6bdab9031cb7c46312201ebbe2d6a23eeb
There are a couple of new file directories inside collateXPrep to look at now: 1) standoff-Spine: Look at this first: it's my output of standoff collation pointers into all the files. My output scoops up sga, too just because it's going through all my witnesses--but I figure that you can transform this output by maybe mapping your pointers into it with an ID transform? ( I wonder if that's easier than integrating your functions into the XSLT that generated this?)
1) bridge-P2: new edition files with <seg>
elements. Currently these elements are self-closed, but they're set around variant passages.
https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/bridge-P2
I need to take this through a new "phase 3" of the "Bridge Construction" process. My plan for phase three is to output a bridge-3 directory with edition files in which I've reconstructed the flattened tags into elements, and then re-set <seg>
elements. That's going to be a lot of play with <xsl:analyze-string>
and I need some sleep before I try it! I think I can get this out tomorrow, though...
My last plan is to run a new Python script over the collation output strings, probably the ones in the standoff-Spine directory, to return the largest Levenshtein distance numbers, and plant those in attributes on each <app>
element. I'll then need to write some more XSLT to deliver that information to the edition files and their seg elements--and we can use that for giving hues of intensity of highlighted passages in the output edition files.
@raffazizzi I've been tinkering with XSLT on my way home from DHSI, but figured I'd better run some questions here right now about what we want for the TEI output for our edition. Here's what I'm thinking:
One TEI output is simply an up-converted form (with TEI "trimmings") around the collation XML output we generated. I've been thinking of this as a kind of "spinal column" for our output collation. (See, for a basic idea: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/blob/Text_Processing/collateXPrep/C10_xmlOutput/teiUp-tester.xml ) This would be the source of pointers into separate edition files.
The separate edition files: Here I'm imagining we generate these in TEI (rather than HTML) first, and that we make for each edition a TEI document holding an element that contains a "hotspot" area that indicates a place where this edition varies from the others. I know how to make this for an HTML output from TEI (it's the sort of thing I'd wrap an HTML
<span>
around). But I'd like to find a similar element in the TEI for holding this information. I can think of two options for this: 1) Just keep the<app>
element (though this seems strange) to mark a variant passage. The problem with this is we'll have places where the up-converted single edition (when we un-flatten its paragraph tags for example) will have conflicting hierarchies. Say this edition has a paragraph break where the others do not: its internal tagging will force us to break up the<app>
element, and we'll need a way to mark this is the beginning segment of a "hotspot", to be followed by an ending segment. 2) For the individual edition files, just use a<seg>
element for short segments inside paragraphs, with some appropriate attribute to indicate variance info as a hook into the other editions.I'm planning to study the TEI options typically in use for this kind of thing in the Guidelines after I get done with an afternoon meeting today, but if you have some ideas, please post here! I'll add more as I'm working on this. Outputting the individual editions won't be too difficult once I get through parsing the flattened tags, and I hope to have something up by tonight for us to tinker with.