OregonDigital / OD2

Next generation of Oregon Digital ( https://oregondigital.org ) digital collections platform, built on Samvera Hyrax ( https://github.com/samvera/hyrax/ )
19 stars 1 forks source link

Editing OCR after ingest #1513

Open jsimic opened 3 years ago

jsimic commented 3 years ago

Descriptive summary

The ability to edit OCR post-ingest will allow correction of mistakes and improve searching, provide accurate transcripts, and could possibly be used for small transcription or crowdsourcing projects under the supervision of a curator.

Expected behavior

Authorized users (curators, depositors, admins) are able to access, edit and save the OCR for any object. The corrected OCR text is made available to display and for download.

Accessibility Concerns

Accurate OCR is key for accessibility

wickr commented 3 years ago

Corey noted that since the OCR is stored in hOCR format ( https://en.wikipedia.org/wiki/HOCR ), the direct editing of text in a large textbox would be tricky. A visual editor, would be much easier to use.

Possible hOCR visual editors:

jsimic commented 3 years ago

POSM has reviewed and would like a list of user requirements from Metadeities to inform the selection of an editor.

KevinJonesMeta commented 1 year ago

Metadeities discussed and would like the following:

  1. a side-by-side editor view showing text in context of document and the OCR'd text as shown in both example editors above
  2. OCR editing available to Reviewer level users and above
  3. OCR text is viewable in editor to Depositor level users
  4. Need changes to OCR logged on work like other changes

Contingent asks:

  1. If editing text hierarchy of document and/or editing blank elements in document is useful to support accessibility features of OD, then would like editor to support those edits (see editor example 2 above)
  2. If OD ingests OCR completed outside of OD prior to ingest, need that OCR to be editable as well

Question for Features:

  1. Does OD ingest OCR completed prior to ingest? Adding accessibility features to pdfs before ingest can be robust and we would like to preserve those features at ingest if we don't already.