Open seth-shaw-unlv opened 4 years ago
Is this still an issue?
I think so. Tagging as a feature request until someone can confirm we have PDF OCR.
It should be working. Our installation of it has it working
@DonRichards, and it isn't a Born-Digital specific feature? Can you test on a vanilla Islandora install to confirm?
I think this is still a problem after testing on my local.
This is still an issue. When Hypercube gets a PDF, it uses pdftotext instead. This was done as part of the RDM work.
We did not realize that you can get text out of most text-containing files (and if you want into the solr index) with https://www.drupal.org/project/file_extractor.
I have a PR coming after I run the tests.
I do not have a PR coming. It turns out Hypercube on its own does not accept PDFs, you need a wrapper like ocrmypdf.
Linked to this as well https://github.com/Islandora/documentation/issues/1012
Popped up in the islandora/islandora
queue: https://github.com/Islandora/islandora/issues/910
Stemming from on an email discussion:
Hypercube currently uses pdftotext to extract text embedded in a PDF OR tesseract to perform OCR on images. However, if a user uploads a scanned document as a PDF, it won't perform OCR on the scanned document resulting in no output.
Tesseract can't process PDFs natively (ergo the pdftotext) but we can use pdfimages to extract the images into a temporary directory and loop tesseract over those to produce our extracted text OCR.