How to deliver normalized and original tokens in XML output?

ebeshero commented 2 years ago

@Arithmeticus I remember you telling me it would be easy to show the original strings at each witness together with the normalized string. I'm having trouble figuring out where and how to do this.

So far I have only been tinkering with TAN-fn-strings-collate-standard.xsl, which seems to be the XSLT that generates the c's and u's and txt that shares the normalized strings. I keep trying to output a text node for <tan:wit>, but it seems not to have one... What do I need to invoke to output the original string for each witness at the point of a <u> or <c> ?

You mentioned here (a couple of months ago) that you were tinkering with something that was, indeed, delivering the original strings. So I can see it's possible; just I'm having trouble figuring out where to intervene to deliver that.

ebeshero commented 2 years ago

@Arithmeticus I also remember your telling me that simply running tan:collate() on strings would show me the original strings together with their normalized counterparts at each witness. But here's my issue: I'm collating to include a lot of markup from my source documents, so I'm actually using the input and processing and normalizing/replacement parameters you've tidily set up Diff+.xsl. I suppose I could do a lot of pre-processing outside of the TAN library, but these parameters with their <replace> patterns are super convenient!

What I'm trying to do is run with parameters according to the Diff+xsl stylesheet after eliminating all the HTML processing stuff deeper down. (I did get that working easily enough.) I hope I'm not going about this all wrong, but those parameters did deliver me a way to pre-process my XML and treat its tags as text that I could normalize. So I hope there's a way to dig the original strings out, following the Diff+ method of approach to TAN's function library.

Arithmeticus commented 2 years ago

hi @ebeshero -- you must be in the throes of post-deadline-Balisage-paper writing. Like me.

tan:wit has no text node because it just provides metadata on the preceding-sibling::tan:txt. That's where the tan:collate() model differs from the Frankenstein approach. Which I like, but I would need to study better to consider a way to incorporate it as an option in tan:collate(). (I have ideas.)

So you can't do it with tan:collate() directly. TAN Diff+ is doing it because it (1) memoizes the original strings, (2) runs the normalized form of those strings through tan:collate(), then (3) uses tan:replace-collation() to re-infuse the output from 2 with the original form of 1. It's all an admittedly somewhat convoluted process that deserves revisiting. Of course the original task, without considering any code, is already a complicated affair.

One challenge with TAN Diff+ is that (if I recall correctly) there are two stages of normalization. The first round gets ignored by any later comparison. It's like getting rid of the stuff you never want to see again. Only the second round is picked up by that later de-normalization process. Perhaps you're making changes in the fuggeddabouddit stage.

I'll look a bit closer after I get my paper written this weekend.

ebeshero commented 2 years ago

@Arithmeticus I think (if I understand correctly) that I might well be trying to intervene in the "fuggeddabouddit" stage...! So the question is, how did you get that pretty output a couple of months ago? I'd be happy to work with that if I could get it after I do my elaborate normalization regime...

Here's the deal with Frankenstein: we need the original text b/c we're actually constructing the output edition from the critical apparatus. Even if I didn't want to do that (and I really do; I have a post-processing XSLT pipeline all built and it works to read the critical apparatus and reconstruct each witness while storing the collation data)...but even if I didn't and I just cared about the collation output of normalized tokens, I would still want to read the original against the normalized stuff. It's because of all the complicated regex replacement patterns I need to set in place because I'm collating the markup and the text. It's a practical need to make sure my normalization is doing what I need it to do. I need to be able to see them both together.

Irony! I had to call for help from the collateX developers (specifically @djbpitt who was handling the TEI and XML output from collateX), to get them to help me see the normalized tokens instead of only the original text in the critical apparatus. They are like your polar opposites or something (lol). It's funny that in the TAN universe, the normalized versions prevail!

I think I had better make a big point in my paper to each set of developers that, hey, seriously, we need to see both the original strings together with the normalized tokens in collation output in order to be able to review our work and build things from it. :-) Anyway, thanks for your wonderfully intricate work on all this! I don't mean to sound ungrateful--I'm just lost in the TAN labyrinth and eager to get it working.

Arithmeticus commented 2 years ago

I get lost in that TAN labyrinth too, sometimes, so I sympathize! And I totally agree, an output that can capture both the original and the normalized commonality would be a boon.

An important side question. When you get <wit ref="A" pos="1234"/>, the @pos reflects the position as tan:collate() spat the product out. I suppose a second attribute is required specifying the pre-normalized position?

If you share a scratch of the files you're working with, I can make some suggestions, and look at ways at making TAN Diff+ a bit easier to use. In the last 10 days I've been doing some major changes to tan:collate(), including the introduction of a mode that cuts processing time significantly, without loss in quality.

ebeshero commented 2 years ago

@Arithmeticus Okay! Those new developments for tan:collate() sound very exciting!

You are asking about @pos, and my answer to you is: that attribute value is a mystery to me. I can see that @pos gives me a position marker, but that's generated by TAN and as far as I can tell, whether it's a pre-normalized or post-normalized position, it seems like it's not something I can use if I were to go into my own source files and go looking for the passage in question. Or is it? (Sure, the pre-normalized position might be helpful...hmm. But what would it mean in my project, where I'm collating the markup with the text? It would refer to a position in some flattened version of my XML that's reading the angle-brackets as text, right?) If I can use that @pos information programmatically and I knew exactly what it meant, maybe I could work with it to reconstruct my source witness? But right now it's just kind of extra information that confuses me. :-/

Okay: my files! I've just been reorganizing this repo and digging in to TAN to create a serious workspace for Frankenstein. At the moment, we're digging in very deliberately to a small set of collation "chunks" to try to compare what happens in collateX vs. tan:collate. So let me point you to the files I've been working with most recently:

Current inputs:
- (very short): collation units 11b and 11c: https://github.com/FrankensteinVariorum/TAN-2021/tree/master/applications/Diff%2B/fv-source-fewTinyChunks11 I'm already having trouble trying to tell TAN to collate two sets of files and get me two outputs based on the parameter input system: I'm intending to collate all the witnesses that share C11b in their filenames and generate one XML file output, and then collate all the witnesses that share C11c and get another XML output. The parameter inputs aren't helping me and I'm stuck there, but at least the system output C11b for me.
- Longer input that I'd like to process, generating one XML output for each "chunk" identifier in the filename: (C11a through C11o): https://github.com/FrankensteinVariorum/TAN-2021/tree/master/applications/Diff%2B/fv-source-tinyChunks11
- The entire Chunk11 (not subdivided into a-o): https://github.com/FrankensteinVariorum/TAN-2021/tree/master/applications/Diff%2B/fv-source-bigChunk11
TAN XSLT:
- I'm currently working in fv-tanCollate-xmlOut.xs: https://github.com/FrankensteinVariorum/TAN-2021/blob/master/applications/Diff%2B/fv-tanCollate-xmlOut.xsl
- I started making modifications deeper in the labyrinth, probably best summarized if you scroll to the end of this commit: https://github.com/FrankensteinVariorum/TAN-2021/commit/b06b213fb38b00cc4e1f63d9ba100b888369eeff#diff-97e856be3979547816f3e57e3fc71c313ed8e8737927abf19948f2f62d3d6ac0
- I ran into a weird little problem here: functions/TAN-function-library.xsl
- Here's where I started digging in to try to alter the tan collate output: functions/strings/TAN-fn-strings-collate-standard.xsl
Output is all in the applications/Diff directory: The current output is just for "fewTinyChunks11" and is here: https://github.com/FrankensteinVariorum/TAN-2021/tree/master/applications/Diff%2B/fv-collation-fewTinyChunks11 (Yeah, one of the files here has a regex filename and you can guess why: I was trying to figure out how to tell it to name the output file and pointed it to a regex parameter...sigh).

ebeshero commented 2 years ago

^^^ updated the above to point you to the output.

Arithmeticus commented 2 years ago

Very briefly, attribute @pos specifies where the given tan:txt has appeared in the witness, as received by tan:collate(). Because the function is totally oblivious to any other versions of the strings, that's the only thing it can put down for @pos. Being orthogonal the function cannot (and should not) be concerned with previous forms of the input strings.

The user of TAN Diff+, however, who is thinking primarily about an original string and not about its normalized form for the basis of comparison, may not find the current @pos informative because it points to a position within an artificial string.

Arithmeticus commented 2 years ago

Where you're changing the tan:collate() output...I think you're on the right track. I am still thinking about this issue. You are concerned about recording the prenormalized form for every string. And of course, you're thinking of one such prenormalized form. But will there be users who have two or more prenormalized forms, from different stages, that they will want to collate within the same tree structure? If so, then the text shouldn't be put directly into tan:wit but into a child element that specifies the type of reading. Something like:

   <c>
      <txt>v</txt>
      <wit ref="1" pos="3">
         <version type="diplomatic" pos="5">vv</version>
         <version type="corr" pos="5">v</version>
      </wit>
      <wit ref="2" pos="2">
         <version type="diplomatic" pos="4">uu</version>
         <version type="corr" pos="5">u</version>
      </wit>
      <wit ref="3" pos="2">
         <version type="diplomatic" pos="3">vu</version>
         <version type="corr" pos="2">v</version>
      </wit>
   </c>

In this scenario, the original substrings were "....vv", "...uu", and "..vu" (a dot represents a letter not shown in this example). At a correction stage they were changed to "....v", "....u", and ".v". And then for the purposes of collation, they were normalized further to "..v", ".v", and ".v". The output above allows the user to choose which prenormalization form to use, say, for web display. And a @pos is available for every version of every witness.

On the other hand, one could argue that this is confusing, that it should look more like this:

   <c>
      <wit ref="1" pos="3">
         <version type="diplomatic" pos="5">vv</version>
         <version type="corr" pos="5">v</version>
         <version type="comparison" pos="3">v</version>
      </wit>
      <wit ref="2" pos="2">
         <version type="diplomatic" pos="4">uu</version>
         <version type="corr" pos="5">u</version>
         <version type="comparison" pos="2">v</version>
      </wit>
      <wit ref="3" pos="2">
         <version type="diplomatic" pos="3">vu</version>
         <version type="corr" pos="2">v</version>
         <version type="comparison" pos="2">v</version>
      </wit>
   </c>

I'm not sure which way to go. This needs some deliberation.

FrankensteinVariorum / TAN-2021

How to deliver normalized and original tokens in XML output? #6