Stand-off Transformation does not account for insults that span lines

MasonGobat commented 2 years ago

While writing the transformation for our markup to stand off, I got it completely working except that it does not work for insults that span lines, because they are on a different axis. The only way that I could think (as well as Caroline could think) to fix this is using preceding and following axes, but they go through the entire document many times and it is very slow. We don't know if there is a better way, or a way at all. We (@cngish98 @MasonGobat) were wondering if you (@djbpitt) could help.

djbpitt commented 2 years ago

@MasonGobat Which file contains your most recent work?

djbpitt commented 2 years ago

@MasonGobat What would you like the output to look like? Here are a few more specific questions:

The sample in standoffTest.xsl contains speeches (so far, so good), which contain elements of type <w>. An element called <w> would normally be a single word token, one per <w> element, but each of your <w> elements contains what looks like a sentence. This confuses me about whether you want your standoff to point to words (that is, individual word tokens) or lines?
When we discussed standoff originally, what I thought you wanted was to create elements that would point to the contents of each insult, and to do that the items you point to would need to have unique identifiers (I'd recommend @xml:id attributes). This means that if you want to point to lines, you'll want to tag lines and add @xml:id attributes to them; if you want to point to individual word tokens (so that, for example, an insult can begin in the middle of one line and end in the middle of another), you'll want to tag all of the words and put identifiers on them. If you want to copy the insult text from the body into a separate section instead of using pointers we could do that, but I wouldn't choose that option if this were my project. When you eventually generate HTML output you might create a list of insults then, but I'd save that copying of words for the XML-to-HTML stage, and I'd use pointers in the standoff I'd create for this XML-to-XML transformation.
One more detail: If we use pointers, instead of copying, we can either point to all of the words (or lines) of the insult or just the first and last ones. The first and last are fine if your insults are continuous, but they won't be if you have interrupted speech, such as a situation where (made-up example, which you could probably have guessed :-)) Hamlet starts to insult Polonius, interrupts himself to say something to Ophelia, and then resumes his insult. Or when Ophelia interrupts Hamlet in the middle of his insulting Polonius and he then picks up where he left off after the interruption. If you can't be confident that the insult words will be continuous, it's best to point to all of them, and not just to the first and last, since working around the interruptions would be messy. It might be better to point to all of the words anyway (I'll need to think about that), but it would be helpful to know whether we have a choice.

We can add the identifiers with XSLT as part of the transformation that creates the standoff markup, so you don't have to do it manually. For that matter, it's best not to do it manually, since it's the sort of task that computers do better than humans.

MasonGobat commented 2 years ago

@djbpitt The file that contains my most recent work is markedPlaysConversionFinal. That one is tailored to be for the actual plays instead of the test that I was originally working off of. Also, I would like the stand off to include all of the words and punctuation, that way when we go to transform it into HTML it is as simple as just grabbing all the stand off insults and applying templates to get a basic reading view. I don't know how easy the transformation would be otherwise for a reading view. Obviously we are trying to get something that is both easy to understand and easy to transform further. I can make no assurances about continuous insults either. I think that answers most of your questions, if you have more or something that I didn't quite answer, then just ask again and I will try my best to answer them.

djbpitt commented 2 years ago

@MasonGobat Take a look at markedPlaysConversion-djb.xsl. It copies the play as is, but it creates a second <div> inside the <front>, after the cast list, that contains copies of the insults. I'm not entirely happy with this approach, but I was less happy with an alternative I came up with. I've made some inquiries and I'll let you know if anything better emerges, but in the meanwhile, can you please check this and see whether it does what you want? I ran it against Hamlet and the results looked credible, but you know your data much better than I do. I included comments in the code, but you'll want to look up <xsl:for-each-group> if you aren't already familiar with it, and also the difference between <xsl:copy> (makes a shallow copy) and <xsl:copy-of> (makes a deep copy). Please let me know if anything is unclear.

djbpitt commented 2 years ago

@MasonGobat Note that because we copy insult words instead of pointing to them:

We have duplicate @id values. That's okay as long as we don't have to validate against a schema that declares those attributes to be of type xsd:ID, which must be unique in the document. If we want them to be unique and we want to copy, we can change the @id values in the copies.
Copying isn't the same as stand-off, which uses pointers. That isn't a problem, and if the copies do what you want and you like the way they work, you should go with them. But the concept of stand-off assumes that you're tagging things in the regular content, but doing it with markup that points to the content, instead of wrapping it in start- and end-tags.

I had occasion to discuss the insult tagging this evening with the former head of the W3C XML action group, and he said that we were probably being overly scrupulous about avoiding the long horizontal axes (following:: and preceding::). They are, indeed, less efficient, but XPath processors typically optimize for a predicate value of [1] because that's such a common idiom, so the expression will find the delimiter quickly and not look beyond it. We can experiment with this if the grouping approach doesn't work out for some reason, and perhaps even if it does, just to learn more about the performance.

MasonGobat commented 2 years ago

@djbpitt I do believe that your uploaded version works. I knew a few places to look to ensure it was grabbing past a single line and it passed. If I am understanding now what perhaps I didn't when we first started this project, then I think we did our mark up completely wrong for doing stand off. Perhaps not, but I think from your explanation we should have been adding extra attributes perhaps instead of tags. It is entirely possible that I am still missing some niche point, but thankfully what you created does the task for creating the stand off that I, and the group included, would like to work with. I would have perhaps liked to destroy the insultStart and insultEnd tag in the psuedo standoff, but I could easily accomplish this if the group would like for us to do this as well. Due to the success of your code I am going to close this issue with this comment, but if you feel the need for us to continue in this thread, then I don't know if there is a way to reopen the issue, but you could start one with a similar premise.

EmilyMartin42 / Shakespearean_Insults

Stand-off Transformation does not account for insults that span lines #8