FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

empty mod elements in SGA Notebooks #25

Closed ebeshero closed 6 years ago

ebeshero commented 7 years ago

Hi @raffazizzi : I'm working with the SGA files now and trying to set them in a good working state for me and @Rikkm to proceed with marking stuff like words at ends of lines prior to collation. I've compiled the full c56 and c57 notebooks together as a single file (via canonicalize in oXygen as you advised), and I located and associated the ODD-generated SGA schema (shelley_godwin_odd.rng ) with the file. The weirdness I'm now seeing is 153 of the same validation error: everywhere that there's an empty <mod> element. The error message reads: "The mod element is intended to group a series of related changes to the manuscript. Thus, mod must have more than one child element. If only a single addition or deletion is being encoded, mod is not required."

Here's a sample of the code that triggers the error:

<mod spanTo="#c56-0022.04"></mod>

Not all of the <mod> elements are coded this way and I think most aren't causing a problem (they're usually wrapping around a combination of <del> and <add> as we'd expect.

I'd like to deal with the validation problem if we can. I imagine the empty <mod> elements must be relevant, though I'm not sure how to read them (they seem to appear as immediately preceding siblings to a single <del> from what I've seen scrolling through).

I'd like to fix this if we can b/c I'd like the files to be in a fully valid "oXygen green" state as we comb through them prepping for collation--so if/when we introduce new errors it's immediately evident. Should we modify the ODD that generated this schema to deal with these mod elements, or is there something else we can do?

What we're planning is to add <w> or <seg> elements to those partial words that break over lines and require some flag for us to combine them. (I need to investigate a little more what element works best here. I don't want to add something that requires a custom ODD rule for this (I doubt we'll need to). But I do want us to start with a fully valid workspace if we can since we need to do some ornate work--it'll help to see what we're doing as soon as we introduce a new error. Let me know what you think is best to do about these <mod> elements.

ebeshero commented 7 years ago

@raffazizzi @Rikkm A simple patch would be for me to comment out all the empty mods, so they're still there but not firing errors while we work. Is that the best thing to do?

ebeshero commented 7 years ago

@raffazizzi @Rikkm One more observation: I've only got the ODD Schematron rules applying on the files, which seems to get me a nearly clean workspace. The c58 file is perfectly green this way.)

<?xml-model href="sga_schemata/shelley_godwin_odd.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>

But this is a little brittle b/c we don't have TEI constraints on right now. As soon as I apply the RelaxNG schema line or a TEI-All schema line, I see lots of validation trouble, mostly on TEI attributes. I imagine there's just some stuff to be poked at in the ODD.

Well, what matters for us right now is just to have a good workspace for collation prep. And Rikk and I will need to add one element as a flag for the collation process. I should really set a schema constraint on it, and the easiest way for me, I think, is just to write a separate little Schematron patch.

The most serious issue for us right now is clearing a path around those errors on the empty <mod> elements--what makes the most sense for us here?

raffazizzi commented 7 years ago

Sadly, because of decisions made before my time at MITH, S-GA files use a handful of non-TEI attributes (I think only on <surface>) that we can't easily get rid of because of our software pipeline. They're in their own namespace, so, while conformant, TEI_all won't validate the files.

I wonder if those empty mods are really necessary. They would be only necessary for aggregating <delSpan> or <addSpan> with another element. I'll have a look this week. @ebeshero good luck with the talk!

ebeshero commented 7 years ago

@raffazizzi thanks for taking a look...my sense was we might just want to revisit the project ODD to address the rng schema line trouble, but I didn't locate the ODD itself in the flurry of conference activity. We're on this afternoon--lots of slides to talk about! :-) (I am not talking about this stuff at all--just the general approach and issues of resolving the collation on two diff TEI encoding methods.)

ebeshero commented 7 years ago

@raffazizzi @Rikkm I'm reading the sga ODD files now--really helpful documentation! I'm also creating in this (Pgh Frankenstein) repo new files that validate (hopefully) to SGA project specs as we work, and I'd like to add a rule or two to guide our markup of word boundaries. I'll start a new issue on that. This is basically the first stage of distinguishing the Pgh_Frankenstein working files for collation from the SGA files themselves. Ultimately we'll be creating a set of files with markup that guides collation around word tokens (so we can reliably locate word boundaries around add, delete and line markup). @Rikkm and I agreed in our late July meeting, after some intensive document analysis, that we'll really need to do this work carefully by hand (it seems the most reliable way given the complexity of the diplomatic markup). I need to make sure we have some clear schema rules in place to guide the work on our end.