OCR of PDFs in Tika can take a long time. This is unnecessary if the PDF has already been ORCed.
I would like to see an option to define the OCR strategy used by Tika in the lodestone front end.
Ideally, this would be multi-pass with a first pass being no_ocr and if the size of returned data is below a threshold (perhaps 500 bytes of text) then re-process with text_and_ocr to recognize the document.
OCR of PDFs in Tika can take a long time. This is unnecessary if the PDF has already been ORCed.
I would like to see an option to define the OCR strategy used by Tika in the lodestone front end.
Ideally, this would be multi-pass with a first pass being no_ocr and if the size of returned data is below a threshold (perhaps 500 bytes of text) then re-process with text_and_ocr to recognize the document.