Open-Book-Genome-Project / sequencer

A toolchain of tasks for sequencing and fingerprinting book fulltext
https://bookgenomeproject.org
43 stars 14 forks source link

Quality improvement by searching for common OCR errors (transferred from OL) #97

Open RayBB opened 1 year ago

RayBB commented 1 year ago

Original text from https://github.com/internetarchive/openlibrary/issues/810:

Sorry if this is out of place, but I just stumbled across an oddity. It appears that the Google-digitized non-English editions have some habitual problems in the OCR which shows up in the boilerplate they inserted.

For instance, Googling: "carcfully scannod" site:archive.org turns up 46,900 results, most of which are scanned from texts in languages that use diacritics. That can't be a coincidence. I'm wondering if it can be put to use for quality improvement. Might they just need a fresh run through OCR with more modern software?

More discussion in the thread.