Add feature to OCR the pages of each PDF without a text layer, so text appear in Alfresco search results

chris001 commented 9 years ago

Would it be possible to add a feature to the Alfresco PDF Toolkit. "Add OCR'ed Searchable Text Layer to Existing PDF Documents that have no searchable text layer," and make it compatible with the new 5.0.

Details of a sensible open source implementation below:

1.Identify all PDF's with no text layer in the repository (it’s easy to identify, in Alfresco, between a PDF with or without a text layer, use the PDFBox library included in Alfresco) and run the following actions, a custom ContentTransformer, on each one of these PDFs:

split each PDF document into multiple images : one for each page - the open source tool to use is PDFtk.
run an OCR engine on each image (page), in order to extract the text (and layout) from the image. The input is a PDF document, the output is a hOCR file. The open source tool to use is Tesseract-ocr.
merge each image page and the its corresponding hOCR file into a PDF. The result will contain the visual content from the input image with a hidden text layer from the hOCR file. The open source tool to use is hOcr2Pdf.
merge back all PDF's created for each page into a single PDF. The open source tool to use is PDFJoin.

A linux script would run the whole process, called from Alfresco through a custom ContentTransformer. This is a special ContentTransformer because it has an identical source & target Mime type! We don’t want Alfresco to use this ContentTransformer in an uncontrollable way, so we created it as “unregistered”, which means that they are not find-able through the Transform service and can be called only by direct reference. Furthermore, as the OCR processing can be quite demanding for the server processor, best run it at night.

Every night, a job uses PDFBox to find the new PDF documents in the repository with no text layer, and it automatically calls the custom ContentTransformer on each one of them. Then, the job creates a new version of the PDF document in the repository from the ContentTransformer output.

To summarize, we take a multiple-page PDF with only an image layer that we transform into another multiple-page PDF which looks identical, an added hidden text layer behind the scanned images that includes the OCR output. hOCR is an open format based on HTML, which represents an OCR output, by combining layout and style along with the recognized text itself.

ntmcminn commented 8 years ago

Yeah, I like the idea! Been a bit silent on this project, moving it to the new SDK and Alfresco 5.x at the moment. Once the 5.x conversion is done, let's take a look.

douglascrp commented 8 years ago

Hello. I understood the request, but I think the OCR process part is already done here https://github.com/keensoft/alfresco-simple-ocr The transformer part can be reused, letting pdf-toolkit only with the job of finding the pdfs and executing the action to include the OCR layer.

douglascrp commented 5 years ago

@OrderOfTheBee/order Should we close this one? It is old, and nowadays we have other options able to do what is being asked here.

Funies commented 5 years ago

HI,

you can close it

On Tue, Mar 5, 2019 at 9:42 PM Douglas C. R. Paes notifications@github.com wrote:

@OrderOfTheBee/order https://github.com/orgs/OrderOfTheBee/teams/order Should we close this one? It is old, and nowadays we have other options able to do what is being asked here.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/OrderOfTheBee/alfresco-pdf-toolkit/issues/10#issuecomment-469829776, or mute the thread https://github.com/notifications/unsubscribe-auth/AQCTpppA70wuDI86Yjly8WptFhvu6yACks5vTsiMgaJpZM4FyQdF .

DevilBit commented 2 years ago

Please how can I add the scan documents feature into the alfresco share

OrderOfTheBee / alfresco-pdf-toolkit

Add feature to OCR the pages of each PDF without a text layer, so text appear in Alfresco search results #10