FrankensteinVariorum / fv-data

TEI data for the Frankenstein Variorum project
The Unlicense
3 stars 0 forks source link

Convert hypothesis annotations to W3C Web Annotation JSON-LD #11

Closed mdlincoln closed 4 years ago

mdlincoln commented 5 years ago

This script will

  1. Harvest hypothes.is annotations for the frankenstein group
  2. For each witness' set of annotations, search through the TEI XML version of that witness and revise the XPath selectors
  3. Emit JSON-LD compliant with W3C Web Annotation data model

Supersedes https://github.com/PghFrankenstein/fv-postCollation/issues/9 Supersedes https://github.com/PghFrankenstein/fv-postCollation/issues/3 Supersedes https://github.com/PghFrankenstein/fv-postCollation/issues/2

mdlincoln commented 5 years ago

Got a drafty script at https://github.com/PghFrankenstein/fv-data/blob/master/hypothesis/openannotation/oa_convert.py

Some open questions I have for @raffazizzi :

Right now I'm just practicing conversion by targeting the underlying XML for the HTML versions of 1818 & 1831 that the annotations group has been making. I'm not fully clear on how that targeting is supposed to resolve on the final website, because I think it depends how we'll be rendering the annotations. Exactly which TEI file should I be targeting? If it's one reconstructed from the spine during a site build, then I need an example of what that reconstruction looks like, with all the elements tagged with xml ids so they can be uniquely pointed to.

The second issue is that that annotations don't start and end at xpath selections - they also require character offsets so that we know e.g. "this annotation starts 12 characters after the start of /p[5] and ends 37 characters before the end of /p[6]. Right now I don't see combinations of character offsets with xpath selections in the WA selector vocabulary, although hypothes.is includes such selectors. The alternative is a raw TextPositionSelector:

{
  "type": "TextPositionSelector", 
  "start": 247332,
  "end": 247347
}

which we could only calculate once we have the fully-rendered TEI (or maybe the fully rendered html DOM after ceteicean does it's thing?)?

Any thoughts on this?

ebeshero commented 5 years ago

Hi @mdlincoln: At first I thought we would store pointers to the annotations in the spine files, which will be generating the final editions. If we were to do that, we would need a way to correlate at least the starting word of an annotation with its location on the spine. That is probably not the best way, though. @raffazizzi and I were recently discussing a separate file storing pointers to the annotations that would be mapped to the output edition files. We have output TEI here in the fv-data repo for those first 10 collation units we could experiment with for seeing if we can get pointers to resolve with XPath and character offsets.

ebeshero commented 5 years ago

@mdlincoln Here’s where to find the TEI output: https://github.com/PghFrankenstein/fv-data/tree/master/variorum-chunks

mdlincoln commented 5 years ago

@ebeshero right, storing annotations within TEI was our initial conversation last fall. The most recent agreement, as reflected in the work scope doc, was that we'd target the Web Annotation (formerly "Open Annotation") standard, which can point to the TEI files. Nothing in the TEI files would need to be aware of the annotations.

@raffazizzi and I were recently discussing a separate file storing pointers to the annotations that would be mapped to the output edition files.

Well, I don't know what this means, especially given that when we last talked, everyone seemed agreed we were using the Web Annotation standard used by Recogito, among others. I'm very glad to take point on reshaping the annotations since I've done so much work with them already, but I can't do that without being in the loop on how you are set on handling them.

mdlincoln commented 5 years ago

from the scope doc:

Annotations will be stored in a JSON file, one per each alignment chunk (may span chapters).

This will match the chunks as shown in https://github.com/PghFrankenstein/fv-data/tree/master/variorum-chunks, yes? I can at least start building out the converter based on those chunks.

ebeshero commented 5 years ago

@mdlincoln I think I just miswrote that passage you quoted—sorry. You are right that the idea is to keep the annotations in a separate file. The pointers do go to the output edition chunks. @raffazizzi and I just talked about how the pointing seems to work in Recogito and it was not a set plan but an observation about separating the note data from the pointers. If I remember this right, the surprise to me was that we might want the pointer data to be a distinct place in between the annotations and the output edition files.

It is something like what we do with the spine files in their last stage where the elements only contain pointers to set locations in the chunk files. For import of Shelley-Godwin Archive files, Raff is doing pretty much the same thing as what we need to with the annotations—try and calculate the location using XPath and character offset ranges. We simply discussed how it might work. I don’t think the storage of pointers in a separate file or in the same JSON file with the notes is that big of a deal—the important test right now is whether the mapping via XPath and character offsets can work in locating annotated passages.

Let’s consider: What are the differences between the output chunks and their early forms being annotated?

ebeshero commented 5 years ago

@mdlincoln I took a look at your Python script after writing the above—and it reminds me that I have a Python script that defines strings to ignore to read around during the collation process—I think we want to do something similar here to filter out the new <span> elements in the collationChunk files. Am running late for a meeting but I can point you to my Python script and the relevant lines later today.

mdlincoln commented 5 years ago

@ebeshero ah ok, good - I figured we were on the same page.

It's been no issue to filter the <span> elements - I'm just needing to experiment around how line breaks are working differently in the TEI p elements vs the HTML ones. Will keep trying.

mdlincoln commented 5 years ago

I've been trying to work this by looping through the p elements of the target TEI and checking to see if the highlighted text is present. However, there's inconsistency in how spacing between seg elements is happening, which is making string matching against the HTML annotations quite difficult. An example from f1831_C03.xml:

<seg xml:id="C03a_app13-f1831">and the mild<pb n="8" xml:id="F1831_v_024"/>ness of his
 discipline. </seg>

               <seg xml:id="C03b_app1-f1831">This circumstance, added to his well known integrity and
 dauntless courage, made me very desirous to engage him. A youth passed in solitude, my best
 years spent under your gentle and feminine fosterage, has so refined the groundwork of my
 character, that I cannot overcome an intense distaste to the usual brutality exercised on board ship:
 I have never believed it to be necessary; and when I heard of a mariner equally noted for his
 kindliness of heart, and the respect and obedience paid to him by his crew, I felt myself peculiarly
 fortunate in being able to secure his services. I heard of him first in rather a romantic manner, from
 a lady who owes to him the happiness of her life. This, briefly, is his story.</seg>

               <seg xml:id="C03c_app1-f1831">Some years ago he loved a young Russian lady, of 
moderate fortune; and having amassed a considerable sum in prize-money, the father of the girl
 consented to the match. He saw his mistress once before the

If I grab all the text from these three segs and concatenate to get the complete text of their parent paragraph, I end up with a space between the sentence break of the first two, but not between the sentence break of the 2nd pair. One way around this would just be to remove all whitespace for the sake of doing comparisons and at least finding the proper p to match to. But these inconsistencies may make it difficult to calculate the character offsets. Any ideas?

ebeshero commented 5 years ago

My sense of this is very drawn from XPath, where I can run a normalize-space() function to remove all extra whitespaces (I think in addition to two). Do we have an analogous function in Python? Looks like people are doing a combination of split() and join() to do this...See https://stackoverflow.com/questions/46501292/normalize-whitespace-with-python

mdlincoln commented 5 years ago

I can remove the spaces fine, but it's calculating offsets that'll be strange if there's inconsistency in whether we can trust that breaks betweenseg elements shouldn't contain characters. The break between <seg xml:id="C03b_app1-f1831"> and <seg xml:id="C03c_app1-f1831"> in the example above suggests an implicit space in the original text between the end of one sentence and the start of the next. But <seg xml:id="C03a_app13-f1831"> (the first one) explicitly includes the whitespace after the end of the sentence - which I presume is "real" whitespace from the witness?

In other words, are we confident that the spacing inside seg elements is correct?

mdlincoln commented 5 years ago

hope that makes sense.... it's late in the day for me 😪

ebeshero commented 5 years ago

Yes--it makes sense! I think we can always expect there to be one real white space between the end of one sentence and the beginning of the next. Let me take a quick look at our <seg> elements to be sure I remember how I was outputting them. If you want, I can do another output in which I remove the format-and-indent--basically no pretty-printing. I bet that would help!

mdlincoln commented 5 years ago

@ebeshero yes, removing pretty-printing will certainly help out with the general "getting rid of extraneous linebreaks" problem.

Do please look around at some of the seg boundaries though, because even if pretty-printing is turned off, I think there's still some odd behavior in whitespace at the very end of some segs that coincide with the ends of sentences. Most seem to have that one real whitespace included before the closing </seg> tag, but a few do not?

(fyi I'm away for a long weekend tomorrow, but I'll be looking back at this again on Monday morning. Thanks for your help sorting out stuff)

ebeshero commented 5 years ago

White spaces will be the bane of our existence, I think. It was ever thus. I'll see if I can make this more predictable!

ebeshero commented 5 years ago

I think the question is whether a white space is output before or after the seg, and there may actually be some reason for each variation--I'll dig around the code and see.

ebeshero commented 5 years ago

A thing to try if you have a minute while waiting for me: My gut is telling me that yes, we'll be okay if we trust that white spaces between <seg> elements are not part of the edition. I think we can trust this because we needed tight white space control for the collation. So, for the moment we could test my hypothesis: pretend those white spaces don't matter and see if you get reliable matches. Does it work?

ebeshero commented 5 years ago

@mdlincoln I'm carefully working through my pipeline of XSLTs in the fv-processing repo to strip white spaces during processing and I think I'm making progress, but it's outputting super long lines in our fv-data TEI files. Hopefully not an issue for you--except now I imagine that if I get rid of all extra white space between elements, there will be less whitespace here than in the HTML to which the hypothes.is annotations are hooked.

Maybe this isn't really an issue if we can read around those white spaces in the HTML files anyway. Anyway, I think the fv-data files you're working with will at least have fewer whitespaces between <seg> elements than before when you come back to them. Let's see how this goes. I'm still in the middle of processing the pipeline and checking the outputs--I'll ping when I've got a new batch of files for you.

ebeshero commented 5 years ago

Yikes--no! I posted too soon: the white-space removal I tried early in my pipeline results in removing too many white spaces so words are getting spliced--not good. Back to the drawing board...

ebeshero commented 5 years ago

@mdlincoln Okay! I think I found a solution by writing a script to trim off text nodes that are all and only white spaces--a solution that leaves the edition data alone and just trims off the stuff between the <seg> and the <p> elements. This might be helpful to @raffazizzi too, and maybe anything that will involve counting character offsets.

Just in case these are not helpful or we see something wrong, I've posted these in a new pair of directories here in fv-data:

Give these a look and let me know if they will help!

raffazizzi commented 5 years ago

Hi @mdlincoln and @ebeshero , sorry for jumping in late! It looks like you've already ironed a few wrinkles out, but here's my 2c.

The second issue is that that annotations don't start and end at xpath selections - they also require character offsets so that we know e.g. "this annotation starts 12 characters after the start of /p[5] and ends 37 characters before the end of /p[6]. Right now I don't see combinations of character offsets with xpath selections in the WA selector vocabulary, although hypothes.is includes such selectors. The alternative is a raw TextPositionSelector:

{
  "type": "TextPositionSelector", 
  "start": 247332,
  "end": 247347
}

This is a fair solution and it works with the Web Annotation standard, but an alternative solution would be doing the same thing we do in the spine, that is using XPointer and pretending that it's a URL to a resource rather than a selector within a resource.

Example: ox-ms_abinger_c56-0045.xml#string-range(//tei:zone[@type='main']//tei:line[1],8,10)"

There will be code in the app to resolve these kind of pointers, so we would be able to re-use it for annotations anchored in this way. The downside is that XPointers are not well supported so it would make our annotations harder to re-use, while the TextPositionSelector would (I hope) be more useful.

However, as you've discussed, white space will get in the way because of XML conventions (multiple spaces === one space when parsing). So in generating XPointers like the one above we have been counting multiple spaces as 1 character. Whether this is good practice or not, it's yet to be established. We are at the bleeding edge of this IMO.

raffazizzi commented 5 years ago

@mdlincoln I've also just seen you email about this (sorry, I'm still piecing everything together) where you mention Recogito's use of XPathSelector. I wonder if it would be possible to use that plus a relative string range selector? Because that's basically what the XPointer does: it uses XPath to pick the starting place, then it counts characters from there.

mdlincoln commented 5 years ago

@ebeshero @raffazizzi Based on our chat last week, I've done another pass at lining up annotations using the following process:

  1. Do a rough match (pulling just the text content from the chunk TEI and removing all whitespace, and matching against the hypothesis text annotation with all of its whitespace removed)
  2. When a potential chunk has been identified, cycle through the chunk text (including those texts NOT inside seg elements) to find a starting position, then continue cycling to locate the ending position.
  3. I'm using lxml to locate elements, so when text is inside a seg I can get the seg id and the character offset from the start of the element. If the text is outside a seg, it'll give me the nearest element and the offset from the end of that element - not sure how we'd like to represent that in the jsonld
  4. Outputs to look at:
    1. Successfully-matched annotations as jsonld: https://github.com/PghFrankenstein/fv-data/blob/master/hypothesis/data/oa.jsonld
    2. Annotations (in original hypothes.is format) where I found them in a chunk, but the element-matching steps couldn't find a start and end element (just a few of these, looks like they're really lengthy) https://github.com/PghFrankenstein/fv-data/blob/master/hypothesis/data/missmatch.json
    3. Annotations where the first chunk-matching step failed, mostly b/c they belong to parts of the book not yet collated in the variorum-chunks directory https://github.com/PghFrankenstein/fv-data/blob/master/hypothesis/data/nomatch.json

Current shortcoming: by looping chunk by chunk, this will only account for annotations that do NOT cross chunks. I can reconfigure this to look at the union of all chunks when finding the start and end elements. However, this is only worth doing if the interface will show multiple chunks at once. If the plan is to have separate chunks per page, then the annotations need to be split by chunk - the web annotation framework specifically points at ONE target url. Any thoughts on this?

ebeshero commented 5 years ago

@mdlincoln I think we need to be prepared for separate chunks per loaded page and decide where they belong (move if necessary). I am thinking that since most of our chunks begin and end on chapter boundaries anyway it is unlikely we will see many that span across the seams—they seem like special cases requiring special handling and perhaps a bit of rewriting if we do want some annotation on either side of a collation seam.

ebeshero commented 5 years ago

@mdlincoln That said, I might be wrong, especially if @jaquirk et al are marking chapter headings. Can we get a sense of whether and how often we see a chunk-spanning annotation? I can think of one possible case where 1831 introduces a new chapter early in the novel (within our prepped C01 - C10 in variorum-chunks).

mdlincoln commented 5 years ago

@ebeshero Right, I'm assuming that there will be at least a few chunk-spanning annotations. It's tricky to id those now, given that we can expect many of the annotations to have no successful matches yet since they fall outside of the text that's been collated to date. Could you spot check the notmatch.json file to see if there are any that you know should have a match somewhere in the current chunks? An example or two would help me code up something that could help detect those.

raffazizzi commented 5 years ago

If annotations are split over a chunk, I would suggest repeating the annotatiton in each chunk to make sure they show up.

@mdlincoln re: your point 3., this is how Recogito handles it:

"target" : {
    "source" : "https://recogito.pelagios.org/part/771d7bff-b2a1-4506-a29d-98152b2ce3d7",
    "type" : "Text",
    "selector" : [ {
      "type" : "TextQuoteSelector",
      "exact" : "nge my flame.\n          Then Faithless whit"
    }, {
      "type" : "RangeSelector",
      "startSelector" : {
        "type" : "XPathSelector",
        "value" : "/tei/text/body/div/div/lg/l[6]"
      },
      "endSelector" : {
        "type" : "XPathSelector",
        "value" : "/tei/text/body/div/div/lg/l[7]"
      }
    } ]
  }

You can see that there is an XPathSelector for the start and end elements (can be the same), and a TextQuoteSelector to match the actual string. How does the hypothesis data handle this?

The alternative would be using an XPointer string-range() selector like we do in the collations, the advantage being that we already know how to resolve those, but other systems may not. It looks like in Web Annotation that would be modeled as a FragmentSelector: https://www.w3.org/TR/annotation-model/#fragment-selector

mdlincoln commented 5 years ago

@raffazizzi That's not quite the problem though.

Here's what I can produce now: take this fragment of XML from f1818_C09.xml:

<seg xml:id="C09_app398-f1818">endeavours so </seg>soon as I should point them towards 
<pb n="086" xml:id="F1818_v1_098"/>the object of my search, than to exhibit that object 
already accomplished. I was like the <seg xml:id="C09_app406-f1818">Arabian </seg>who 
had been buried with the dead, and found a passage to <seg xml:id="C09_app409-f1818">life
 </seg>aided only by one <seg xml:id="C09_app411-f1818">glimmering, </seg>and seemingly
 <seg part="I" xml:id="C09_app413-f1818__I">ineffectual, light.</seg>
{
"target": {
      "source": "https://ebeshero.github.io/Pittsburgh_Frankenstein/Frankenstein_1818.html",
      "type": "Text",
      "selector": [
        {
          "type": "TextQuoteSelector",
          "prefix": "at object already accomplished. ",
          "exact": "I was like the Arabian who had been buried with the dead",
          "suffix": ", and found a passage to life ai"
        },
        {
          "type": "RangeSelector",
          "startSelector": {
            "type": "XPathSelector",
            "value": "//[@xml:id='F1818_v1_098']"
          },
          "startOffset": 75,
          "endSelector": {
            "type": "XPathSelector",
            "value": "//[@xml:id='C09_app406-f1818']"
          },
          "endOffset": 33
        }
      ]
    }
}

The startOffset in this case happens to be 75 characters from the start of the startSelector.value, however the endOffset is actually 33 characters from the end of endSelector.value. n.b. if we had another annotation that started in a non-seg element, then it might well be possible that it's startSelector/startOffset pair would similarly be counting from the closing tag of the startSelector rather than the opening tag.

I have access to that count-from-inside vs. count-from-close in the python code - but I don't see what fragment selectors have to do with indicating where the offset count is starting in relation to the selected element?

I could standardize so that all the counts go from the start of the selected element - it would depend entirely on the assumptions of the rendering software though.

mdlincoln commented 5 years ago

(fwiw this character offset problem, and the problem of linking annotations across witnesses, goes away if we can just wrap everything in seg elements - I wasn't sure how difficult that is to do in the collation pipeline though, @ebeshero ?)

raffazizzi commented 5 years ago

I don't see startOffset and endOffset being part of the Web Annotation Data Model, unless I'm looking in the wrong place? https://www.w3.org/TR/annotation-model

If it's an extra field we're supplying, then I would suggest using TEI XPointer string-range() instead, just so that we use the same pointing mechanism across the collation and the annotations. Here' s how it would look like (remember that I'm assuming space is 'normalized').

string-range(//*[@xml:id='F1818_v1_098'],75, 88)

With string-range you only need the XPath to the starting point, then you start counting characters from there ignoring tags that you find along the way (so collecting tree leaves, or text nodes, as you go). So if I counted correctly the selection starts at char 75 from the start selector and ends at char 88 from the start selector. TEI documentation is here

In the Web Annotation Data Model there's an example with XPointer under FragmentSelector, but it conforms to the XML specification, which is very limited in what it can do. We would have to say that it conforms to the TEI specification (linked above).

{
"target": {
      "source": "https://ebeshero.github.io/Pittsburgh_Frankenstein/Frankenstein_1818.html",
      "type": "Text",
      "selector": {
          "type": "FragmentSelector",
          "conformsTo": "https://tei-c.org/release/doc/tei-p5-doc/en/html/SA.html",
          "value": "string-range(//*[@xml:id='F1818_v1_098'],75, 88)"
       }
    }
}

NB this would be possible to express using the XML XPointer specification from the examples in the Web Annotation Data Model, but if I understand it correctly, it would need three selectors because the count has to happen within one elemenet, not adjacently to it. If you're interested I can cook up an example.

ebeshero commented 5 years ago

@mdlincoln I'm taking a look at your no-matches, and I see the exact matches are mostly quite short, and I do find them quickly inside the edition files. I "cheated" to that, and the cheat might be instructive: I used this page that I asked the Annotations team to consult so that they could quickly survey this text at a different moment in the other editions: https://pghfrankenstein.github.io/Pittsburgh_Frankenstein/tableView.html

If you search a text string on this page and find it in something with a green /grey background, it's in the finalized variorum-chunks and the collation app numbers are reliable. These also show you some of what you're wishing you had available: an edition with every alignment moment marked, because every collation app (including the unified ones) are represented. This is an HTML table view of the P1 files that contain full text (and flattened tags) at every moment of alignment. I know that you'd like me to produce the variorum edition with <seg> elements marking the unison passages as well as the variant ones, and I hear you, but I just don't have time (and now is really not the right moment) for me to redo the pipeline to accommodate that. If we do this at all, it'll have to be next year since it would definitely disrupt the data that the Agile team is working with. Also this is not as easy to produce as you might think and there are good reasons not to add elements when they do not communicate variance. What we're doing follows TEI's parallel segmentation model which specifically does not flag passages when they run in unison. I'm not in favor of adding markup that isn't meaningful to the variorum and might introduce white space issues and a chain of new problems with variorum output than we need to deal with. But I am wondering whether we might approach the method of target connections to destination files a little differently and find a reasonable compromise. Try me on this:

We don't have completed variorum-chunks for the whole edition yet. There will be more of them in a few months, but not right now. When we have them, we can be sure that the collation <app> units in the spine and the <seg> elements in corresponding editions are stable. We do have, however, a scaffolding to work with that could possibly help make relocation of variants a little easier. We have P1 files, the first stage of constructing the spine, that hold the entire edition, and the data from P1 is output on the HTML page I linked.

What if you were to attempt locating passages in this very differently structured file? And what if annotations came in with start and end point data associated with P1 XML files (or this HTML output of them)? We do still have the problem of the post-C10 collation units being subject to revision, but even so, if it's easier to automate the location of attachment points with respect to these documents, perhaps that's a better way to proceed. It would mean, I think (if I understand this right), building a JSON file that points to start and/or end positions in P1. You'd have to read around fewer tags, but you'd also have flattened markup to dodge around, too. I'm not sure the situation is much better, but I do wonder if it's easier to search this way.

I also wonder (from reading @raffazizzi 's post above) if the pointer positioning process would benefit from treating all markup as text characters--not attempting to deal with hierarchy at all but dodging anything in angle brackets. Annotations are supposed to be able to interrupt hierarchies after all. I think we just want to indicate specific moments inside text strings where an annotation begins to apply, and where it ceases to be applicable. If we can do that in the P1 files as they are prepped to go into the pipeline to make the variorum editions, can we then scaffold the attachment data along with the variorum editions as they're being constructed?

ebeshero commented 5 years ago

@mdlincoln Reviewing this, I'm realizing we have a couple of problems: My P1 directory has only C01 - C10 in it. But the files that went into my web view for the Annotations team are in these two directories--basically pre P1:

collated-data (through C10, basically an ur-version of P1 files): https://github.com/PghFrankenstein/fv-postCollation/tree/master/postColl-workspace/collated-data

unready-collated-data: (up through C26: https://github.com/PghFrankenstein/fv-postCollation/tree/master/postColl-workspace/unready-collated-data

I just realized we have some thorny and interesting collation batch processing issues to address with the last set of collation units, C27 - C33, so, alas, the entire novel isn't totally available in this format just yet. But I can make that a priority when I get back to collation editing. (It's just not going to happen in the next month with conference travel and talks on Frankenstein and Mitford coming up, not to mention a release of the TEI Guidelines for which I'm responsible!)

But the methodological question holds: Can we work with these tabular files, and are they easier to deal with than variorum edition files for indicating start and end points for annotations?

mdlincoln commented 5 years ago

@ebeshero to start off, I agree that adding more collation pipeline work to your plate is impossible now - since it's a heavy lift, lets hold off.

There's two big issues here:

Targeting an annotation on the variorum viewer page

Given the variety of pointer possibilities, perhaps we need to restrict ourselves to a pointer system already implemented by a specific annotation display software. It'd certainly make my job easier. If web annotation won't include xpath + character offset, then could we just eventually use e.g. AnnotatorJS which accepts offsets? (it's the core of hypothes.is after all) It doesn't make much sense for me to continue working on the annotations without having that decision in place. @raffazizzi I hear you on the xpointer schema, but who's implemented an annotation viewer for that yet?

Matching an annotation from witness A to equivalent location in witness B

The beauty of the collated files is that they minimize markup. However they are a difficult thing to work with programmatically for the task of picking an arbitrary span of text from one witness, and finding the associated spans of text (or non-text, in the case of a witness not having any text in that spot). The "tabular" mode you describe provides a more consistent interface for that task and others like it - it's certainly more natural to program around. Again, though, I wouldn't worry about building that out until we have an answer to how we eventually envision displaying annotations

What do we want for the MVP?

For this round, obviously we aren't going to be building our own annotation display - so for demo purposes I trust it'd make sense to enter and display some annotations literally just using hypothes.is? :P

mdlincoln commented 5 years ago

If we can do that in the P1 files as they are prepped to go into the pipeline to make the variorum editions, can we then scaffold the attachment data along with the variorum editions as they're being constructed?

@ebeshero sorry I missed this bit when I was reading through. This means baking annotations into the variorum text, does it not? I strongly believe our current decision - to keep them a separate data layer - makes life far simpler. We can keep the variorum viewer application focused on variations, and separate out the concerns of the annotation viewer component - which has to be separate anyway, since we've no resources to build it out now! Creating them in an open, standard format makes them more easily interperable and extensible - if we ever wanted to add e.g. comments on annotations, or allow other users to contribute annotations, we wouldn't need to touch the underlying variorum TEI again.

mdlincoln commented 4 years ago

@ebeshero @raffazizzi Good news: mapping from the HTML p elements to the XML p elements seems to work entirely straightforwardly for the 1818 edition. I'm going to work up output for 1818 and 1831 for the p elements and push it here for your review later today.

If we like how it looks, then I'll dig into mapping the h* elements, and at least creating a list of all non-h*/p-anchored annotations in case we need to craft any by hand.

ebeshero commented 4 years ago

Hooray! That’s a relief! So far so good...

mdlincoln commented 4 years ago

See files at hypothesis/openannotation/*_xml_id_mapping.json

At the bottom of each JSON object is some diagnostic data including the p index from the original HTML, the XML content at that index + and the offset I'm using for that particular witness.

This works fine for 1818, but for 1831 the introduction section totally changes the assumptions about how ordering of paragraphs go - anything annotations after the introduction I can line up with an offset of -19 from the html p index, but I've had the script skip over annotations showing up before then. Take a look at the output and advise.

mdlincoln commented 4 years ago

Elisa, would this be made simpler if you could produce full XML files for the witnesses that include xml IDs for all p elements the way that you'd eventually label them in the collated chunks? Right now I'm concatenating the files from variorum-chunks to get a version of the XML where I can count through paragraphs, but this obviously causes issues for those annotated paragraphs that aren't in chunks yet.

ebeshero commented 4 years ago

@mdlincoln Yes—I was thinking that when we were meeting. It is easy enough to generate. Where’s a good place for me to push them?

ebeshero commented 4 years ago

Notes to myself: Reviewing my processing pipeline: very early, before the texts go through collation, I'm generating what will become those @xml:id attributes. I see I'm doing that at the point when I flatten the markup, where <p> elements turn into flattened <p sID="id1"/> and <p eID="id1"/> markers. Full files of the complete editions with these id's are already prepped here: https://github.com/FrankensteinVariorum/fv-collation/blob/master/collateXPrep/print-fullFlat/ , and here's 1818 for example: https://github.com/FrankensteinVariorum/fv-collation/blob/master/collateXPrep/print-fullFlat/1818_fullFlat.xml

The problem is that the @xml:ids are being split into "Trojan horse" markers with @sID and @eID and the flattened XML is probably not the easiest to match up with the HTML on which the annotations were made.

Basically what I need to do is pretty simple: output the @xml:ids in the full versions of these files in this directory: https://github.com/FrankensteinVariorum/fv-collation/tree/master/collateXPrep/print-full.

ebeshero commented 4 years ago

@mdlincoln I've pushed a directory of the XML files for the complete print editions (1818, Thomas, 1823, and 1831), that assigns the elements @xml:ids exactly the way they're assigned in the output variorum chunks. It's here: https://github.com/FrankensteinVariorum/fv-data/tree/master/hypothesis/migration/xml-ids

Note:

All the elements in these XML files should be just as they will be in the output variorum, and all that is missing are <seg> elements marking collation hotspots. Let me know if you want me to change something here. For example, would it be better for me not to convert the <div> elements into <milestone/> markers in this copy? (No problem--that's easy to change--just let me know what's better for your processing.)

mdlincoln commented 4 years ago

@ebeshero I'll start to take a look this afternoon/tomorrow morning

mdlincoln commented 4 years ago

@ebeshero I notice the new XML doesn't contain the http://www.tei-c.org/ns/1.0 namespacing that shows up in the collated chunks. Is that intentional? Either way is fine by me, but might be good to add it for consistency's sake, since I need to specify namespaces when processing the XML.

ebeshero commented 4 years ago

@mdlincoln Good catch. I think it doesn't show up in the output because the input isn't actually defined in a namespace. (It's understood as XML on its way to becoming TEI by the end of our processing.) I'll see about adding it and send up the files again.

ebeshero commented 4 years ago

@mdlincoln It's trivially easy to put the namespace in, but having done it I'm not sure it's the right thing to do with this XML because it really isn't TEI. There's not a TEI root element, or a teiHeader element, or the basic text structure of a TEI document. I can create them, I suppose...

mdlincoln commented 4 years ago

@ebeshero reading your earlier comment, I understand why they don't belong there now! Let's leave the XML as it is now.

ebeshero commented 4 years ago

@mdlincoln Ah! I didn't see your comment, and I went home and produced some TEI this evening and pushed it up with this commit: https://github.com/FrankensteinVariorum/fv-data/commit/6c0b26e4a30aebcd976d2b36eece76fcc9229585 . The TEI isn't valid to the TEI-all schema mainly because it lacks any larger wrapper <div> elements. (The HTML on which the hypothes.is annotations were made doesn't have any big <div> wrapper elements either.) But in proper TEI fashion, I've given these files a proper TEI Header and documented these issues in an <encodingDesc>. These are actually half-decent TEI files for providing some documentation anyway, and if we want to distribute them as part of the Variorum's deliverables, well, these are more informative and helpfully sharable than they were before. Take a look at these files and see if they're useful! I think your offsets may need to change a bit now that there's a TEI header that's different from the HTML headers, but otherwise, I hope these are useful for the migration.

I had an idea as I was creating these that perhaps we should "raise" the div elements from the milestones, but I see we don't have that structure in the HTML, so I think for now we should leave them. But if we decide to serve these files as serious TEI deliverables for the project, I think I should raise the <div>s.

mdlincoln commented 4 years ago

The 1831 XML doesn't include the introductions, but there are plenty annotations are on them (for example, from 1831: "It is not singular that, as the daughter of two persons of distinguished literary celebrity, I should very early in life..."

Should I try to skip these annotations? Or can we generate XML containing them?

mdlincoln commented 4 years ago

I've pushed updated mapping json files for both p and h3/head elements (excluding elements in the introduction where the offset results in an index-out-of-bounds error)

ebeshero commented 4 years ago

@mdlincoln Sorry about somehow missing the 1831 introductions! If the annotations team annotated them, I should definitely be generating them here--let me see if I can find out what's wrong. (I wonder if I got the preface material in the other editions, too...)