Closed MasonGobat closed 2 years ago
@MasonGobat Which file contains your most recent work?
@MasonGobat What would you like the output to look like? Here are a few more specific questions:
<w>
. An element called <w>
would normally be a single word token, one per <w>
element, but each of your <w>
elements contains what looks like a sentence. This confuses me about whether you want your standoff to point to words (that is, individual word tokens) or lines?@xml:id
attributes). This means that if you want to point to lines, you'll want to tag lines and add @xml:id
attributes to them; if you want to point to individual word tokens (so that, for example, an insult can begin in the middle of one line and end in the middle of another), you'll want to tag all of the words and put identifiers on them. If you want to copy the insult text from the body into a separate section instead of using pointers we could do that, but I wouldn't choose that option if this were my project. When you eventually generate HTML output you might create a list of insults then, but I'd save that copying of words for the XML-to-HTML stage, and I'd use pointers in the standoff I'd create for this XML-to-XML transformation.We can add the identifiers with XSLT as part of the transformation that creates the standoff markup, so you don't have to do it manually. For that matter, it's best not to do it manually, since it's the sort of task that computers do better than humans.
@djbpitt The file that contains my most recent work is markedPlaysConversionFinal. That one is tailored to be for the actual plays instead of the test that I was originally working off of. Also, I would like the stand off to include all of the words and punctuation, that way when we go to transform it into HTML it is as simple as just grabbing all the stand off insults and applying templates to get a basic reading view. I don't know how easy the transformation would be otherwise for a reading view. Obviously we are trying to get something that is both easy to understand and easy to transform further. I can make no assurances about continuous insults either. I think that answers most of your questions, if you have more or something that I didn't quite answer, then just ask again and I will try my best to answer them.
@MasonGobat Take a look at markedPlaysConversion-djb.xsl. It copies the play as is, but it creates a second <div>
inside the <front>
, after the cast list, that contains copies of the insults. I'm not entirely happy with this approach, but I was less happy with an alternative I came up with. I've made some inquiries and I'll let you know if anything better emerges, but in the meanwhile, can you please check this and see whether it does what you want? I ran it against Hamlet and the results looked credible, but you know your data much better than I do. I included comments in the code, but you'll want to look up <xsl:for-each-group>
if you aren't already familiar with it, and also the difference between <xsl:copy>
(makes a shallow copy) and <xsl:copy-of>
(makes a deep copy). Please let me know if anything is unclear.
@MasonGobat Note that because we copy insult words instead of pointing to them:
@id
values. That's okay as long as we don't have to validate against a schema that declares those attributes to be of type xsd:ID
, which must be unique in the document. If we want them to be unique and we want to copy, we can change the @id
values in the copies.I had occasion to discuss the insult tagging this evening with the former head of the W3C XML action group, and he said that we were probably being overly scrupulous about avoiding the long horizontal axes (following::
and preceding::
). They are, indeed, less efficient, but XPath processors typically optimize for a predicate value of [1]
because that's such a common idiom, so the expression will find the delimiter quickly and not look beyond it. We can experiment with this if the grouping approach doesn't work out for some reason, and perhaps even if it does, just to learn more about the performance.
@djbpitt I do believe that your uploaded version works. I knew a few places to look to ensure it was grabbing past a single line and it passed. If I am understanding now what perhaps I didn't when we first started this project, then I think we did our mark up completely wrong for doing stand off. Perhaps not, but I think from your explanation we should have been adding extra attributes perhaps instead of tags. It is entirely possible that I am still missing some niche point, but thankfully what you created does the task for creating the stand off that I, and the group included, would like to work with. I would have perhaps liked to destroy the insultStart and insultEnd tag in the psuedo standoff, but I could easily accomplish this if the group would like for us to do this as well. Due to the success of your code I am going to close this issue with this comment, but if you feel the need for us to continue in this thread, then I don't know if there is a way to reopen the issue, but you could start one with a similar premise.
While writing the transformation for our markup to stand off, I got it completely working except that it does not work for insults that span lines, because they are on a different axis. The only way that I could think (as well as Caroline could think) to fix this is using preceding and following axes, but they go through the entire document many times and it is very slow. We don't know if there is a better way, or a way at all. We (@cngish98 @MasonGobat) were wondering if you (@djbpitt) could help.