Closed yatrik-cloud closed 1 year ago
Can you please share the original PDF for debug? Also, please try to use "--ignore-existing-text" flag. Maybe pdf2pdfocr is ocr'ing a PDF that already contains text.
Yes, I checked, that the pdf was already containing text. So is there any option such that, it ignores the existing test or removes the existing text. And apply ocr on that pdf such that, there is no text in the pdf? and also it doesn't duplicate the text.
"So is there any option such that, it ignores the existing test or removes the existing text." To ignore existing text use : "--ignore-existing-text" To remove existing text and force OCR text only, try to use "-f -g smart". In this case, OCRed PDF will be rebuilt and size can change a lot.
It worked with "-f -g smart" Thank you so much.
Following OCR processing on these PDFs, attempts to extract text from the PDF using different techniques, such as code-based extraction or direct copying from the browser-rendered PDF, result in the entire text being duplicated / getting all the text twice than the text actually there in the pdf.
For instance, if the original text contains 5 characters, post-OCR, it recognizes and extracts 10 characters, effectively causing duplication of the content.