Configurable Tika ocr strategies for PDFs

LodestoneHQ / lodestone

Personal Document Archiving (DMS, EDMS for Personal/Home Office use)

https://forms.gle/u1RXnbocbFWqfxGb9

GNU General Public License v3.0

521 stars 28 forks source link

Configurable Tika ocr strategies for PDFs #100

Open wombat94 opened 4 years ago

wombat94 commented 4 years ago

OCR of PDFs in Tika can take a long time. This is unnecessary if the PDF has already been ORCed.

I would like to see an option to define the OCR strategy used by Tika in the lodestone front end.

Ideally, this would be multi-pass with a first pass being no_ocr and if the size of returned data is below a threshold (perhaps 500 bytes of text) then re-process with text_and_ocr to recognize the document.