Open equationcrunchor opened 5 years ago
That's unfortunately correct. Tesseract is non-deterministic, meaning that ocr-ing the same document twice will lead to subtly different results (e.g. reading an "o" as an "e".) In our current pipeline, we ocr each document twice: the first time to generate the pdf, the second time to generate the text file for the search. I hadn't thought about that difference between text and pdf before but clearly, we should do it all in one. @samimak37 : Ideas? We could use Py2PDF to extract the text once the document has been ocred (https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file)
I think that's the best option at the moment. Because of how Tesseract handles the OCR, there is no way to guarantee the same result with different scans (although it is usually very consistent). I can work on the fix today
I’ve been poking around with ocr.py
and I’ve found a few things.
1) Tesseract apparently encodes the text within the PDF in a way that PyPDF2 cannot read. This results in a string that is nothing but newlines, and is therefore not very helpful.
2) Other packages exist that can pull text from PDF files, but most of them are very Unix-centric. I have found the most success with pdftotext, although it has Poppler as a dependency. This is not a terrible problem for macOS and Linux users, but there is no "easy" way to install Poppler with Windows.
Is this a good route to pursue? Other packages include textract and tika, but similar problems are found (it should also be noted that tika runs through a server, which massively increases runtime).
@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself.
@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself.
I've thought of that briefly. Erica's documents should be easy to track and we could add a column to the metadata sheet that's checked if we have hand-corrected text for the document so it doesn't get overwritten when it's changed. We could maybe even store the hand-corrected ocr in the google sheet. Your edits would be harder to track because (I think) they were more dispersed.
Example:
meme
. http://127.0.0.1:8000/archives/doc/3_19_pmm_memo_re_709_1960_04_29_1_19 is first result.meme
in text, onlymemo
. Highlighting the sentenceStatus of programming memo and revision of machine shut-down date to late July.
and copy pasting elsewhere gives correct text.data/processed_pdfs
folder. It saysStatus of programming meme
, probably due to OCR error.Seems like PDF preview and search have different opinions on the OCR?