We will show how a DeezyMatch model can be created from token-level alignments of OCRed text and their manual corrections. We will use the aligned tokens generated in [6] using a corpus of OCRed newspaper texts (from the National Library of Australia Trove digitized newspaper collection) that are aligned with human corrections performed by volunteers [5]. We will show how to train a DeezyMatch model that learns OCR transformations from newspaper data, and will show how it can be used to find a match for a given OCRed query from a pool of potential candidates from a specific knowledge base.
Prepare tutorial on using DeezyMatch for OCR: https://dh2022.adho.org/workshops-and-tutorials/wt-13