brobertson / Lace2

In-broswer OCR editing program that transforms OCR results into structured, citable TEI. No XML experience required!
http://trylace.org
GNU General Public License v3.0
27 stars 2 forks source link
exist-db ocr tei-xml

Lace2: From OCR to TEI

(A complete manual is available in Google Docs format.)

Designed for the large-scale scholarly digitization of primary texts, Lace is a GUI-based OCR editing suite with a difference: it outputs structured, citable TEI Simple, bridging the gap between OCR’s page-based layout and a publication-ready document without the proofreader/editor ever confronting XML data.

Lace’s in-browser editing environment, comprising a page image and a facing OCR transcription, makes possible three operations. A proofreader may verify the OCR text, aided by an adjacent popup image of the word image. Secondly, she may draw rectangular zones on the page image. These correspond to the functional regions of the page such as ‘translation’, ‘commentary’ and ‘primary text’ and also indicate proper reading order. Finally, a GUI widget allows her to place a citation within the text of these zones. Internally, citations are CTS-URNs but the widget’s type-ahead form field allows the proofreader to search by author and title.

Combining these data through powerful Xquery scripts, Lace generates a TEI Simple document which, for each of these zones, collects all text across every page. It transforms the citations into nested div elements which reflect the hierarchy of the citation system. Because all zones of the pages can have citations applied, the correlations between, for instance, primary text and translation are indicated in the output document. Furthermore, in every zone, page break (<pb/>) milestones are retained, and a line mode is offered, whereby line break (<lb/>) milestones are offered and OCR dehyphenation processes are not applied. In this way, the proofreader converts page-based OCR data into a publication-ready TEI document without any understanding of XML required.

Lace is more than a TEI-generating program, though. It produces zip files of OCR training data from verified words. With this, an operator can bootstrap the OCR of a previously intractable script or font, editing, say, five pages of poorly OCR’d text, then re-processing the entire volume with a classifier generated from these pages. Lace will retain those five corrected pages, allowing proofreaders to continue with the rest of the text. Lace also provides a Lucene-based search function which refers to its results with references where possible.

Lace is built upon the well-established eXist-db XML database: it and its OCR data are installed as easily-managed packages through eXist’s drag-and-drop interface. An open-source project, Lace’s source code and compiled modules are stored in an active github repository, and a site for exploring its functions is offered at http://trylace.org Lace is a well-established platform: the majority of the 24 million words in the Open Greek and Latin’s First Thousand Years of Greek project were edited with Lace.

Lace-2 Tools is a separate repository for Lace-related code, especially pre-processing.

Bruce Robertson

2020-06-29