In text extraction of pdf of characters are recognized double times.

LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!

Apache License 2.0

273 stars 34 forks source link

In text extraction of pdf of characters are recognized double times. #46

Closed yatrik-cloud closed 1 year ago

yatrik-cloud commented 1 year ago

Following OCR processing on these PDFs, attempts to extract text from the PDF using different techniques, such as code-based extraction or direct copying from the browser-rendered PDF, result in the entire text being duplicated / getting all the text twice than the text actually there in the pdf.

For instance, if the original text contains 5 characters, post-OCR, it recognizes and extracts 10 characters, effectively causing duplication of the content.

LeoFCardoso commented 1 year ago

Can you please share the original PDF for debug? Also, please try to use "--ignore-existing-text" flag. Maybe pdf2pdfocr is ocr'ing a PDF that already contains text.

yatrik-cloud commented 1 year ago

Yes, I checked, that the pdf was already containing text. So is there any option such that, it ignores the existing test or removes the existing text. And apply ocr on that pdf such that, there is no text in the pdf? and also it doesn't duplicate the text.

LeoFCardoso commented 1 year ago

"So is there any option such that, it ignores the existing test or removes the existing text." To ignore existing text use : "--ignore-existing-text" To remove existing text and force OCR text only, try to use "-f -g smart". In this case, OCRed PDF will be rebuilt and size can change a lot.

yatrik-cloud commented 1 year ago

It worked with "-f -g smart" Thank you so much.