Inconsistent OCR - Githubissues

dhmit / computation_hist

Archival History of the MIT Computation Center

BSD 3-Clause "New" or "Revised" License

5 stars 17 forks source link

Inconsistent OCR #350

Open equationcrunchor opened 5 years ago

equationcrunchor commented 5 years ago

Example:

Searching document text for meme. http://127.0.0.1:8000/archives/doc/3_19_pmm_memo_re_709_1960_04_29_1_19 is first result.
Looking at PDF preview online, there is no meme in text, only memo. Highlighting the sentence Status of programming memo and revision of machine shut-down date to late July. and copy pasting elsewhere gives correct text.
Check OCR text in data/processed_pdfs folder. It says Status of programming meme, probably due to OCR error.

Seems like PDF preview and search have different opinions on the OCR?

srisi commented 5 years ago

That's unfortunately correct. Tesseract is non-deterministic, meaning that ocr-ing the same document twice will lead to subtly different results (e.g. reading an "o" as an "e".) In our current pipeline, we ocr each document twice: the first time to generate the pdf, the second time to generate the text file for the search. I hadn't thought about that difference between text and pdf before but clearly, we should do it all in one. @samimak37 : Ideas? We could use Py2PDF to extract the text once the document has been ocred (https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file)

samimak37 commented 5 years ago

I think that's the best option at the moment. Because of how Tesseract handles the OCR, there is no way to guarantee the same result with different scans (although it is usually very consistent). I can work on the fix today

samimak37 commented 5 years ago

I’ve been poking around with ocr.py and I’ve found a few things.

1) Tesseract apparently encodes the text within the PDF in a way that PyPDF2 cannot read. This results in a string that is nothing but newlines, and is therefore not very helpful.

2) Other packages exist that can pull text from PDF files, but most of them are very Unix-centric. I have found the most success with pdftotext, although it has Poppler as a dependency. This is not a terrible problem for macOS and Linux users, but there is no "easy" way to install Poppler with Windows.

Is this a good route to pursue? Other packages include textract and tika, but similar problems are found (it should also be noted that tika runs through a server, which massively increases runtime).

mscuthbert commented 5 years ago

@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself.

srisi commented 5 years ago

@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself.

I've thought of that briefly. Erica's documents should be easy to track and we could add a column to the metadata sheet that's checked if we have hand-corrected text for the document so it doesn't get overwritten when it's changed. We could maybe even store the hand-corrected ocr in the google sheet. Your edits would be harder to track because (I think) they were more dispersed.