As an NLP expert, I want to review the text corpus of the PPA test set, the existing preprocessing code, the decision log, and existing literature on OCR quality so that I can offer a recommendation on how the team should preprocess the text corpus moving forward.

jerielizabeth commented 4 months ago

Dependent on info from :

existing code base
newest text version
log of past decisions

~Outcome - Answers to the following questions:~

~- [ ] do we need to revise any previous decisions about preprocessing~ ~- [ ] do we need to modify / create code for preprocessing~

Actual outcomes

[x] #19
[x] running Google doc list of specific questions/decisions we need to make regarding preprocessing
[x] Papers tracked in a comment here and in our group Zotero library

mnaydan commented 4 months ago

Here is the decision log, which I pulled together from past meeting notes.

For reference, here is the existing preprocessing code in the cleaning.py script on the repo, which is copy and pasted from Wouter's colab notebook.

And here is the updated (new Hathi OCR) text corpus filtered for the test set.

laurejt commented 4 months ago

Literature (see group Zotero library):

Liimata et al. (2023). Effect of data quality on the automated identification of register features in ECCO
Lyu et al. (2021). Neural OCR Post-Hoc Correction of Historical Corpora
Todorov & Colavizza (2020). Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition
Nguyen et al. (2020). Neural Machine Translation with BERT for Post-OCR Error Detection and Correction
van Strien et al. (2020). Assessing the impact of OCR quality on downstream NLP tasks
Hill & Hengchen (2019). Quantifiying the impact of dirty OCR on historical text analysis: ECCO as a case study
Hämäläinen et al. (2019). From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction
Nguyen et al. (2019). Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing
Hämäläinen et al. (2019). Revisiting NMT for Normalization of Early English Letters
Bollmann (2019). A Large-Scale Comparison of Historical Text Normalization Systems
Smith & Cordell (2018). A Research Agenda for Historical Multilingual Optical Character Recognition
Schulz & Kuhn (2017). Multi-modular domain-tailored OCR post-correction
Garrette & Alpert-Abrams (2016). An Unsupervised Model of Orthographic Variation for Historical Document Transcription
Milligan (2013). Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997-2010
Underwood & Auvil (2012). Basic OCR correction
Hockey (1986). OCR: The Kurzweil Data Entry Machine

Code Repositories:

NATAS: Library for processing historical normalization and OCR post-correction
DataMunging: OCRNormalizer and rulesets seem of most interest
Code for Assessing the Impact of OCR Quality on Downstream NLP Tasks: Primarily for extrinsic OCR evaluation measurements

Princeton-CDH / ppa-nlp

As an NLP expert, I want to review the text corpus of the PPA test set, the existing preprocessing code, the decision log, and existing literature on OCR quality so that I can offer a recommendation on how the team should preprocess the text corpus moving forward. #14