OCR image-only PDFs - Githubissues

Islandora / documentation

Contains islandora's documentation and main issue queue.

MIT License

103 stars 71 forks source link

OCR image-only PDFs #1583

Open seth-shaw-unlv opened 4 years ago

seth-shaw-unlv commented 4 years ago

Stemming from on an email discussion:

Hypercube currently uses pdftotext to extract text embedded in a PDF OR tesseract to perform OCR on images. However, if a user uploads a scanned document as a PDF, it won't perform OCR on the scanned document resulting in no output.

Tesseract can't process PDFs natively (ergo the pdftotext) but we can use pdfimages to extract the images into a temporary directory and loop tesseract over those to produce our extracted text OCR.

DonRichards commented 3 years ago

Is this still an issue?

seth-shaw-unlv commented 3 years ago

I think so. Tagging as a feature request until someone can confirm we have PDF OCR.

DonRichards commented 2 years ago

It should be working. Our installation of it has it working

seth-shaw-unlv commented 2 years ago

@DonRichards, and it isn't a Born-Digital specific feature? Can you test on a vanilla Islandora install to confirm?

DonRichards commented 2 years ago

I think this is still a problem after testing on my local.

rosiel commented 2 years ago

This is still an issue. When Hypercube gets a PDF, it uses pdftotext instead. This was done as part of the RDM work.

We did not realize that you can get text out of most text-containing files (and if you want into the solr index) with https://www.drupal.org/project/file_extractor.

I have a PR coming after I run the tests.

rosiel commented 2 years ago

I do not have a PR coming. It turns out Hypercube on its own does not accept PDFs, you need a wrapper like ocrmypdf.

DonRichards commented 2 years ago

Linked to this as well https://github.com/Islandora/documentation/issues/1012

adam-vessey commented 1 year ago

Popped up in the islandora/islandora queue: https://github.com/Islandora/islandora/issues/910