ProjectMirador / mirador

An open-source, web-based 'multi-up' viewer that supports zoom-pan-rotate functionality, ability to display/compare simple images, and images with annotations.
https://projectmirador.org
Apache License 2.0
548 stars 257 forks source link

Transcription Use Case (Princeton) #585

Open aeschylus opened 9 years ago

aeschylus commented 9 years ago

No display requirements specific for the MOOC/course

References

benwbrum commented 9 years ago

I'm very interested in finding a good example/definition of a full-page transcript expressed as Open Annotation. It should be possible to take page-specific plain-text transcripts produced by DIYHistory and convert them into annotation lists. (Similarly, I'd like to use a similar transformation to build a new transcript exporter for FromThePage.)

I haven't seen any good examples of, say, a three-page document expressed as OA with each page's transcript as a single annotation. (That said, I'm entirely new to OA and JSON-LD, so there's a whole world of things I haven't seen.) Such a document could be used for display by e.g. Mirador as well as production by transcription tools.

Transcript of related conversation with @aeschylus:

azaroth42 commented 9 years ago

A full page transcription, where you don't know any coordinates within the page at all, would just target the canvas in the same way as an image of the full object targets the canvas :)

Rob

On Wed, Aug 19, 2015 at 2:42 PM, Ben W. Brumfield notifications@github.com wrote:

I'm very interested in finding a good example/definition of a full-page transcript expressed as Open Annotation. It should be possible to take page-specific plain-text transcripts produced by DIYHistory and convert them into annotation lists. (Similarly, I'd like to use a similar transformation to build a new transcript exporter for FromThePage.)

I haven't seen any good examples of, say, a three-page document expressed as OA with each page's transcript as a single annotation. (That said, I'm entirely new to OA and JSON-LD, so there's a whole world of things I haven't seen.) Such a document could be used for display by e.g. Mirador as well as production by transcription tools.

Transcript of related conversation with @aeschylus https://github.com/aeschylus:

  • @benwbrum https://github.com/benwbrum Actually, what I'm looking for is an example .json file that contains a transcript for a single page as an annotation, not the way Mirador will render it.
  • @aeschylus https://github.com/aeschylus The question is whether, in OA terms, the transcription target is the whole canvas, a selector whose parameters are 0,0,width,height, or what.
  • I see.
  • @benwbrum https://github.com/benwbrum I'm trying to wrap my head around the data model
  • Gotcha.
  • @aeschylus https://github.com/aeschylus I think we all are to some extent. There are several ways to do most things.
  • @benwbrum https://github.com/benwbrum And may try to build the FromThePage -> OA exporter that was suggested on today's call.
  • @aeschylus https://github.com/aeschylus I can generate one in Mirador really quick, but I’m not sure it will be in the format that ultimately shake out. I would post to the list. Rob answers these kinds of questions quite quickly.
  • It would also help if you added your question to this ticket: IIIF/mirador#585 https://github.com/IIIF/mirador/issues/585
  • Also: http://iiif.io/api/presentation/2.0/example/fixtures/43/manifest.json
  • and: http://iiif.io/api/presentation/2.0/example/fixtures/45/manifest.json
  • These are from the official fixture objects collection.
  • http://iiif.io/api/presentation/2.0/example/fixtures/collection.json
  • @benwbrum https://github.com/benwbrum Are you familiar with the background on the DIYHistory -> OA story?
  • That's something I'm workign with as well.
  • @aeschylus https://github.com/aeschylus Hm, not sure what you mean by background. DIYHistory was/is a prominent crowdsourcing project, and OA is an official global W3C spec for annotation.
  • There are quite a few people who have been working on annotations/linked data.
  • @benwbrum https://github.com/benwbrum I see that you logged that issue, which mentions DIYHistory and a MOOC
  • @aeschylus https://github.com/aeschylus But Rob Sanderson is an editor of the OA spec, as well as IIIF spec.
  • Yeah, the transcription sprint on Mirador is a collaboration with Princeton and Yale.
  • Princeton’s use case involves replacing/augmenting the front end used in a MOOC there, which is currently using DIYHistory.
  • @benwbrum https://github.com/benwbrum So the reason I ask is that, in addition to trying to figure out how to insert my own FromThePage transcription tool into the IIIF/Shared Canvas ecosystem, I'm also looking at ways to take the outputs of other transcription tools (particularly DIYHistory) which produce plaintext transcripts, to transform the product into something that works with OA
  • Obviously there would probably need to be some post-processing from the Omeka export for DIYHistory to manage it.
  • @aeschylus https://github.com/aeschylus That’s very interesting. It would be great if DIYHistory could export OA/an exporter could be written.
  • But the target output data model would likely be the same for both systems.
  • Or the Omeka plugin could be modified.
  • @benwbrum https://github.com/benwbrum Right.
  • I've been talking with Matthew Butler about the postprocessing UIowa does to get DIYHistory (i.e. Omeka) outputs into ContentDM
  • There's a lot of offline conversion that happens.
  • @aeschylus https://github.com/aeschylus Shaun Ellis (@sdellis https://github.com/sdellis on this channel) would be the one to ask.
  • A great demo/proof of concept would be to let FromThePage transcriptions be exported into Mirador, and Vice Versa.
  • @benwbrum https://github.com/benwbrum I know that it should also be possible to convert a plaintext transcript into word-level (or line-level) annotations on zones of facsimile--Desmond Schmidt did that for TranscribeBHL via some computer vision/matching--but that's sort of overkill. page-transcript/page-facsimile annotation should be a good starting place.
  • Agreed on the Mirador<->FromThePage proof-of-concept.
  • I'd need to think a lot about how to ingest the transcript annotations produced by Mirador's transcription code and convert them into wikitext. Not impossible, as I already do that for DjVu files, but it really depends on the data model.
  • @aeschylus https://github.com/aeschylus Yeah, obviously the display can be much nicer and more useful if there is actual data. We have that problem a lot. The best would be some way to compose a full-page transcript from word-level OCR dat.a data.*
  • Should be pretty straightforward.
  • @benwbrum https://github.com/benwbrum Oh -- yes. That's something I'm doing at https://github.com/benwbrum/fromthepage/blob/master/app/models/ia_work.rb#L157-L175
  • @aeschylus https://github.com/aeschylus Nice. Often we just don’t have good enough data to drive a nice display, but that is very good to see.
  • @benwbrum https://github.com/benwbrum If IIIF support ever happens in the Internet Archive, it would be very straightforward to do a Internet Archive -> DjVu2OA -> Mirador demo. (Once we write a DjVu2OA, that is.)

— Reply to this email directly or view it on GitHub https://github.com/IIIF/mirador/issues/585#issuecomment-132797184.

Rob Sanderson Information Standards Advocate Digital Library Systems and Services Stanford, CA 94305

tomcrane commented 8 years ago

Hi Mirador folks, @benwbrum -

I found this discussion while trying to work out why my transcriptions didn't work. I have some whole page transcriptions that I think match what Ben describes, but they weren't showing up in Mirador. So I made a test manifest that always had a xywh fragment in all referenced annos, and that one does work.

Compare:

The former has one anno list per canvas with just one anno in each, like this:

https://tomcrane.github.io/crick-annotations/p1-transcription.json

There is no fragment selector, which trips up this code in Mirador:

https://github.com/IIIF/mirador/blob/master/js/src/annotations/osd-canvas-renderer.js#L25

As a quick workaround, maybe Mirador could assume a region the same size as the canvas if there is no xywh fragment?

There are some other transcription examples here, as described in the readme: https://github.com/tomcrane/crick-annotations

Having seen fromthepage I think one contentAsText annotation per canvas would be the norm, I had this in mind when producing the one-per-page transcriptions here (by hand in a text editor).

Tom

benwbrum commented 8 years ago

Hi, @tomcrane

We made a bit more progress on this front during the Philadelphia hackathon, though I haven't managed to write up the results. I regret that, now that you've run into some of the same issues we did.

FromThePage now produces IIIF collection manifests for images it originates: http://fromthepage.com/iiif/collections A good, small, real-world example is this manifest from a 4-page letter transcribed by a volunteer a couple of years ago: http://fromthepage.com/iiif/99/manifest You can see the full-page, plain-text transcript for a single page at http://fromthepage.com/iiif/99/manifest/4532/list

We did run into the problem you described at the hackathon, but as @aeschylus was there, he suggested that I switch the annotation target from the canvas itself to a full-size region of the canvas. It's a bit of a hack but it does work.

I am extremely dissatisfied with contentAsText as the only way to communicate transcripts between FromThePage and a client. If you compare the transcript in the annotationList with the presentation of the same page you'll see that a lot of the semantic mark-up identifying people mentioned within the text has disappeared, as have the genetic features of the text like underlines. The comments made by the transcriber also aren't present in the annotationList, though I suppose I could include them as another annotation, perhaps with a different @type attribute. (There are also challenges presenting an entire page of text as a single annotation within the demo version of Mirador, but I gather that progress is being made there already in another branch.)

I'm not working on this at present, but I hope to return to this in the next few weeks. In October, I switched development efforts to the IIIF-client features of FromThePage, and will need to circle back to figure out how to re-present IIIF manifests originating from another site, but transcribed in FromThePage.

I'm very interested in your example transcripts. Are you ingesting them into UV for display/search-within? If so, I'd love to sync up.

rsinghal commented 8 years ago

@tomcrane - I think your suggestion regarding a missing region makes sense

tomcrane commented 8 years ago

Hi Ben,

My understanding of cnt:ContentAsText is that it does not require a format of text/plain. The chars property could hold markup:

http://www.openannotation.org/spec/core/core.html#BodyEmbed

If known, the media type of the body SHOULD be given using the dc:format property, for example to distinguish between embedded comments in plain text versus those encoded in HTML. As above, the dctypes:Text class MAY also be assigned along with the cnt:ContentAsText class, as there could be other uses of cnt:ContentAsText that encode resources with content other than plain text.

That is, the chars literal could be text/html.

Ultimately this comes from http://www.w3.org/TR/Content-in-RDF10/#ContentAsTextClass

However, in this post @azaroth42 says:

We went with Content in RDF as it seemed at the time to have some legs and fulfilled our requirements. In the Annotation working group, we've minted our own class, as it's clear that CNT is abandoned and will never reach recommendation status.

... and look at this annotation body from the latest version of the W3C Web Annotation Data Model:

{
  "@id": "http://example.org/anno9",
  "@type":"Annotation",
  "body": {
    "@type" : "TextualBody",
    "text" : "<p>Comment text</p>",
    "format" : "text/html",
    "language" : "en"
  },
  "target": "http://example.org/photo1"
}

Your transcription annotation could have an HTML body, to reproduce http://fromthepage.com/display/display_page?page_id=4532, and further annotations on the transcription annotation could (I think) provide the entity identification of the people mentioned in the text, although I'm not sure the annotation body could be embedded if you wanted to convey this as further separate annotations.

However, this isn't the model that IIIF uses today, and we need something that works well for your use case.

@azaroth42 - maybe the 2.1 spec should include an annotation example that matches Ben's requirement, as this whole page, more-than-plain-text transcription annotation looks like a key use case.

azaroth42 commented 8 years ago

That's all correct. Both Content in RDF, and the replacement syntax in the WG, support arbitrary formats. Thus text/html is fine for both comment and transcription annotations.

Regarding examples, it would be hard to capture all of the possibilities in the spec. I'd encourage further use and publication of, and additions to, the collection of fixture objects. Or creation of a cookbook of use cases with examples.