Open PTrottier opened 6 years ago
Under some OAI metadata, there is also a file .pdf.txt
, it seems to contain a OCR of the .pdf
, perhaps this could help in the detection of OCR as well
See: https://docs.google.com/spreadsheets/d/1LYsmfnEH7F_l98UDMr2DgXIKqMH1_wVXFudORFqTrFk
This seems like a good guide for tesseract-ocr: http://guides.library.illinois.edu/c.php?g=347520&p=4121426
The default tesseract-ocr out of the box is fine for our immediate purposes. So I'll shortcut this by pointing at https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-process-multiple-images-in-a-single-run and saying "do that".
So you use pdfimages to extract the images; convert each image via ImageMagick to .pgm or .ppm format (for greyscale or colour, respectively); use unpaper to straighten the images; then use tesseract to OCR the list of files and join them all back into a single PDF.
@dbs Do you think it's fine if the detection for a searchable PDF with pdftotext is simply checking if the result of the pdftotext is an empty string?
Yes, I think the empty string is a reliable enough indication that there is no searchable text. (You might want to check the number of characters, too, to ensure it's reasonable - like, there should be more than 1,000 characters or some threshold like that)
Would we be better off getting pdfimages
to generate .ppm
files, thus saving us from having to use another dependency such as ImageMagick?
Sure, give it a shot. Theoretically the fewer tools in the chain, the better.
Detection:
Using JHove and/or [pdftotext]()
To OCR: