PTrottier / Boojum-Web

Boojum Technical Reports Website in Flask
0 stars 2 forks source link

Automate Workflow for Verifying PDF OCR Status #4

Open PTrottier opened 6 years ago

PTrottier commented 6 years ago

Detection:

Using JHove and/or [pdftotext]()

To OCR:

  1. pdfimages
  2. convert to [.pbm | .pgm | ppm format]
  3. unpaper
  4. tesseract
PTrottier commented 6 years ago

Under some OAI metadata, there is also a file .pdf.txt, it seems to contain a OCR of the .pdf, perhaps this could help in the detection of OCR as well

See: https://docs.google.com/spreadsheets/d/1LYsmfnEH7F_l98UDMr2DgXIKqMH1_wVXFudORFqTrFk

PTrottier commented 6 years ago

This seems like a good guide for tesseract-ocr: http://guides.library.illinois.edu/c.php?g=347520&p=4121426

dbs commented 6 years ago

The default tesseract-ocr out of the box is fine for our immediate purposes. So I'll shortcut this by pointing at https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-process-multiple-images-in-a-single-run and saying "do that".

So you use pdfimages to extract the images; convert each image via ImageMagick to .pgm or .ppm format (for greyscale or colour, respectively); use unpaper to straighten the images; then use tesseract to OCR the list of files and join them all back into a single PDF.

PTrottier commented 6 years ago

@dbs Do you think it's fine if the detection for a searchable PDF with pdftotext is simply checking if the result of the pdftotext is an empty string?

dbs commented 6 years ago

Yes, I think the empty string is a reliable enough indication that there is no searchable text. (You might want to check the number of characters, too, to ensure it's reasonable - like, there should be more than 1,000 characters or some threshold like that)

PTrottier commented 6 years ago

Would we be better off getting pdfimages to generate .ppm files, thus saving us from having to use another dependency such as ImageMagick?

dbs commented 6 years ago

Sure, give it a shot. Theoretically the fewer tools in the chain, the better.