FrankensteinVariorum / fv-data

TEI data for the Frankenstein Variorum project
The Unlicense
3 stars 0 forks source link

Convert hypothesis annotations to W3C Web Annotation JSON-LD #11

Closed mdlincoln closed 4 years ago

mdlincoln commented 5 years ago

This script will

  1. Harvest hypothes.is annotations for the frankenstein group
  2. For each witness' set of annotations, search through the TEI XML version of that witness and revise the XPath selectors
  3. Emit JSON-LD compliant with W3C Web Annotation data model

Supersedes https://github.com/PghFrankenstein/fv-postCollation/issues/9 Supersedes https://github.com/PghFrankenstein/fv-postCollation/issues/3 Supersedes https://github.com/PghFrankenstein/fv-postCollation/issues/2

ebeshero commented 4 years ago

@mdlincoln Found it--the Introduction to 1831 is something we withheld from the collation process because it's not present in the other editions. So I'd tucked it away in an include file in my XML and didn't expand it here. I did expand it for the HTML that the team annotated, and we do need to account for it in the Variorum Viewer, so it's good we caught this now!

I suppose the 1831 introduction is one giant location of variance--it's simply there in 1831 and not there at any point earlier.

ebeshero commented 4 years ago

@mdlincoln Okay! The 1831 file now properly includes its introduction with this commit https://github.com/FrankensteinVariorum/fv-data/commit/1fbb76def2b5f30bb052339ffa383f83dfb38122 . Let me know if you run into any other snags!

mdlincoln commented 4 years ago

@ebeshero ok, will check it now

mdlincoln commented 4 years ago

@ebeshero Ah, I'm sorry to return with even more problems! A filter I'd put on how I was mapping the annotations was masking some problems that I should have caught before asking you to regenerate.

I'm now seeing annotations team has annotations that start all the way on the title page before the introduction, for example: https://hyp.is/p2gtHJRUEemef2cqr_cA0Q/ebeshero.github.io/Pittsburgh_Frankenstein/Frankenstein_1831.html (reminder, you'll need to be logged into the FV annotation account to see this)

Can you make sure all of that front material also makes it in to the xml with IDs?

ebeshero commented 4 years ago

Ah! I had forgotten I’d given them everything including title pages in the HTML—sorry for rushing through this. I can output the title pages too.

Elisa

Sent from my iPhone

On Nov 7, 2019, at 11:22 AM, Matthew Lincoln notifications@github.com wrote:

@ebeshero Ah, I'm sorry to return with even more problems! A filter I'd put on how I was mapping the annotations was masking some problems that I should have caught before asking you to regenerate.

I'm now seeing annotations team has annotations that start all the way on the title page before the introduction, for example: https://hyp.is/p2gtHJRUEemef2cqr_cA0Q/ebeshero.github.io/Pittsburgh_Frankenstein/Frankenstein_1831.html (reminder, you'll need to be logged into the FV annotation account to see this)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ebeshero commented 4 years ago

Apparently I output ALL the includes for the HTML... Let me get to this in about an hour! I'm in a meeting.

Elisa

On Thu, Nov 7, 2019 at 11:58 AM Elisa Beshero-Bondar < notifications@github.com> wrote:

Ah! I had forgotten I’d given them everything including title pages in the HTML—sorry for rushing through this. I can output the title pages too.

Elisa

Sent from my iPhone

On Nov 7, 2019, at 11:22 AM, Matthew Lincoln notifications@github.com wrote:

@ebeshero Ah, I'm sorry to return with even more problems! A filter I'd put on how I was mapping the annotations was masking some problems that I should have caught before asking you to regenerate.

I'm now seeing annotations team has annotations that start all the way on the title page before the introduction, for example: https://hyp.is/p2gtHJRUEemef2cqr_cA0Q/ebeshero.github.io/Pittsburgh_Frankenstein/Frankenstein_1831.html (reminder, you'll need to be logged into the FV annotation account to see this)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/FrankensteinVariorum/fv-data/issues/11?email_source=notifications&email_token=AA6UDNQMC4Z3KT4UOUIT6FDQSRCMTA5CNFSM4HTXW2RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDNC2RY#issuecomment-551169351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6UDNQDKRB4GATBSHHGWZLQSRCMTANCNFSM4HTXW2RA .

-- Elisa Beshero-Bondar, PhD Director, Center for the Digital Text | Associate Professor of English University of Pittsburgh at Greensburg | Humanities Division 150 Finoli Drive Greensburg, PA 15601 USA E-mail: ebb8@pitt.edu ebb8@pitt.edu Development site: http://newtfire.org

ebeshero commented 4 years ago

@mdlincoln Sorry about the delay! With this commit, I believe I've included the entire frontmatter and backmatter (titlepages, etc) that the Annotations team worked with. https://github.com/FrankensteinVariorum/fv-data/commit/c65d00104efe86671458126036085908a0e2540e Now, let's see if this works!

mdlincoln commented 4 years ago

OK, I think we're pretty close. My spot checks through teh 1831 xml mapping json looks like p and head elements are lining up alright, but please take a close look at the examples, such as:

 "target": {
   "source": "https://frankensteinvariorum.github.io/fv-collation/Frankenstein_1831.html",
   "type": "Text",
   "selector": [
    {
     "type": "TextQuoteSelector",
     "prefix": " \n      \n      \n      \n      by ",
     "exact": "the author of THE LAST MAN, PERKIN WARBECK &c. &c.",
     "suffix": "\n      \n      \n      \n      revi"
    },
    {
     "type": "RangeSelector",
     "startSelector": {
      "type": "XPathSelector",
      "value": "//*[@xml:id='frontmatter1_head6']"
     },
     "endSelector": {
      "type": "XPathSelector",
      "value": "//*[@xml:id='frontmatter1_head6']"
     }
    }
   ]
  },
  "diagnostic": {
   "note": "not for open annotation consumption",
   "html": {
    "start": 6,
    "end": "/h3[6]"
   },
   "xml_text_content": "<head xmlns=\"http://www.tei-c.org/ns/1.0\" xmlns:xi=\"http://www.w3.org/2001/XInclude\" xml:id=\"frontmatter1_head6\">\n         <hi rend=\"smallcaps\" xml:id=\"frontmatter1_head6_hi1\">BY THE AUTHOR OF</hi> THE LAST MAN, PERKIN WARBECK &amp;c. &amp;c.</head>\n    \n      "
  }

The TextQuoteSelector is what hypothesis gave us, the RangeSelector is the new xmlid I've tried to map. The diagnostic object shows the original html locators and the content of the selected XML element, just so it's easier to to check how the matching went.

1818 p elements are fine, but for some reason there's an unexpected disjoint in the h3 numbering - I wonder if it's possible you chnged the underlying HTML after they made an annotation? Anyway, it's just a handful of mismatched ones.

mdlincoln commented 4 years ago

@raffazizzi much of the other parts of the annotation JSON are mocked data, so we can continue to update the template until it looks good. Do you think this is in a good enough state now for you to work on displaying them in the react app? Also do you still think you will have bandwidth to work on it before the end of November, as discussed in the call? Let us know if that has changed.

mdlincoln commented 4 years ago

@raffazizzi @ebeshero checking in - can you please confirm if we're in good shape to get an annotations component set up on the react site?

raffazizzi commented 4 years ago

@mdlincoln as I mentioned at the meeting yesterday, thank you so much for this work! It's looking good, but I've just noticed an issue with the XPath references past the first chunk that, unlike the others, has a simple structure with @xml:ids like preface1_p1.

The XPaths targeting chunk 2, for example, look like this one: //*[@xml:id='novel1_letter1_p1'], but the actual @xml:ids take into acccount the full parent chain, so they look like this: novel1_letter1_div1_p1 or even like this novel1_letter1_div1_ab1_hi1.

Is this something that would be easy to fix?

mdlincoln commented 4 years ago

@raffazizzi I'll take a look today and report back

mdlincoln commented 4 years ago

@raffazizzi so the XML ids that @ebeshero generated for me in https://github.com/FrankensteinVariorum/fv-data/tree/master/hypothesis/migration/xml-ids don't include the full chain like novel1_letter1_div1_p1 - I think this is a mismatch between the IDs in those versions Elisa made for me vs. the versions you're using in the viewer.

@ebeshero if you can make sure the IDs are consistent between the full files I'm looking at and the ones Raff is working from, this should be an easy fix. Thoughts?

raffazizzi commented 4 years ago

@ebeshero for reference these are the files the application loads: https://github.com/FrankensteinVariorum/fv-data/tree/master/variorum-chunks

mdlincoln commented 4 years ago

Also, would we like to try and map the tunneling annotations onto MS and Thomas? I'll need those files with full xml-ids added to https://github.com/FrankensteinVariorum/fv-data/tree/master/hypothesis/migration/xml-ids

ebeshero commented 4 years ago

@raffazizzi @mdlincoln Sorry--I've been in class and a noon meeting, and have class coming up again at 3pm! So I'll take a look at the code more closely later this evening. But for the moment, I'm wondering (out loud here) why these ids are different, as I'm generating both sets of them. I'm sure it won't take me long to figure out what's missing in the set I was generating just now. Here is a guess though: I'm worried that the discrepancy is to do with differences in the XML structure of the output editions for the Variorum, vs. the simpler files that serve as the basis of the separate HTML editions that the annotations team annotated.

On the other hand, reviewing this thread, it just looks like I've left out some basic stuff going all the way up the tree--in which case it should be super easy for me to fix. Fingers crossed it's the latter. As I understood it, the xml:ids for these distinct editions were just supposed to identify the XPath locations of elements in the files the annotations team worked on, so I wasn't thinking at the time about making these be identical to those we're using in the Variorum viewer. I guess they certainly should be the same across all the editions. Sorry about any confusion on this--I bet I can sort it out this evening.

mdlincoln commented 4 years ago

@ebeshero great - and it won't cause any changes in my code, so don't worry about rushing. I can continue my other work without this blocking it.

mdlincoln commented 4 years ago

@raffazizzi questions about the output format for annotations:

  1. Do you want one JSON file per witness, or put all annotations into the same file?
  2. The target.source value should be the URL of the document that the annotation points to - right now, it's just http://frankensteinvariorum.library.cmu.edu/viewer/viewer/ since the different witnesses aren't on different pages. Is that ok? Thoughts?
  3. Likewise we need to set an ID for each annotation - I can certainly generate ones based on the old Hypothesis IDs, but these are expressed as URIs in WebAnnotation data format, and I hesitate to just make up a URI that doesn't actually resolve if you ping it. Thoughts? (we could always literally serve the annotations out as individual JSON documents via the viewer site)
raffazizzi commented 4 years ago
  1. One JSON file per witness.
  2. Yeah, not sure we can do much about that for the time being (besides switching to github.io). Eventually I would like to parametrize edition, chunk, and options
  3. Yes, let's give it a more structured URI, even if it's just wishful thinking for now. Maybe something like https://frankensteinvariorum.github.io/annotations/${source}/${count} ?
mdlincoln commented 4 years ago

also note: for the "tunneled" annotations, I'll only provide the RangeSelector not the TextQuoteSelector since the quoted text only comes from the witness it was originally attached to.

mdlincoln commented 4 years ago

I've pushed revised results up (now including a json file for 1823 annotations "tunneled" from the others)

will update again once we get refreshed files from elisa

ebeshero commented 4 years ago

@raffazizzi @mdlincoln Yikes. I was mystified by my XSLT b/c I was duplicating the same location flags I used to generate the full-flat XML files we send on to collateX. And I saw this commit from July 2018 in which I changed that very XSLT (that flattens these files and produces the xml:ids that we have in the viewer): https://github.com/FrankensteinVariorum/fv-collation/commit/8aafed555910de37918010495098aa3496a6c21a#diff-83ea56ef15a44bcb7d9bc667b19c300a

I must have been thinking in July 2018 to remove the XPath levels because they're redundant or something. And I seem not to have followed through with it (I obviously didn't run that XSLT after making the change or we wouldn't have the Variorum collated XML with the ids we have now. The scary thing is I don't remember what I had in mind two years ago! Maybe I was just experimenting with the file. (I wish I could remember!) Anyway, since we are relying on that XPath information now in all our xml:ids, I'm putting it back as it was before, and regenerating the XML files. I'm 99.99% positive everything will match up now, especially since I've figured out why the ids turned out differently!

ebeshero commented 4 years ago

@raffazizzi @mdlincoln I think with this commit I've repaired our xml:ids on the hypothes.is migration XML edition files, so they match up with the ids on our Variorum collation files: https://github.com/FrankensteinVariorum/fv-data/commit/6dfb0e1e5be070f00a75263fe5107283f9040dfd

mdlincoln commented 4 years ago

Thanks @ebeshero, will re-run my scripts this morning and push the refreshed results.

mdlincoln commented 4 years ago

I've pushed updated annotations json, now also including the Thomas annotations.

@raffazizzi n.b. That I've migrated the hypothesis "tags" according to the examples shown in the W3C guide - they're siblings to the annotation comment in the body list, typed with "purpose": "tagging" e.g.

  "body": [
   {
    "type": "TextualBody",
    "purpose": "tagging",
    "value": "romance"
   },
   {
    "type": "TextualBody",
    "purpose": "tagging",
    "value": "imagination"
   },
   {
    "type": "TextualBody",
    "purpose": "tagging",
    "value": "jk"
   },
   {
    "type": "TextualBody",
    "value": "Like Robert Walton's love for poetry, Henry Clerval's love for books of chivalry and romance makes him sociable and open to domestic affections, unlike Victor.  Victor will later regret that he did not have Henry's or Victor's orientation to languages and poetry at the most critical moments of his life.",
    "creator": "https://hypothes.is/users/frankensteinvariorum",
    "modified": "2019-10-04T17:16:51.923291+00:00",
    "purpose": "commenting"
   }
  ]
raffazizzi commented 4 years ago

@mdlincoln and @ebeshero this looks good and everything seems to work now. Thanks both. I renamed the file with Thomas annotations to match the internal ID the app has been using (that's Thomas instead of Thom).

mdlincoln commented 4 years ago

noted! I've adjusted the generation script to account for that exception.