Living-with-machines / DeezyMatch

A Flexible Deep Learning Approach to Fuzzy String Matching
https://living-with-machines.github.io/DeezyMatch/
Other
134 stars 34 forks source link

Add OCR tutorial for DH2022 #124

Open mcollardanuy opened 2 years ago

mcollardanuy commented 2 years ago

Prepare tutorial on using DeezyMatch for OCR: https://dh2022.adho.org/workshops-and-tutorials/wt-13

We will show how a DeezyMatch model can be created from token-level alignments of OCRed text and their manual corrections. We will use the aligned tokens generated in [6] using a corpus of OCRed newspaper texts (from the National Library of Australia Trove digitized newspaper collection) that are aligned with human corrections performed by volunteers [5]. We will show how to train a DeezyMatch model that learns OCR transformations from newspaper data, and will show how it can be used to find a match for a given OCRed query from a pool of potential candidates from a specific knowledge base.