LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
266 stars 33 forks source link

the right edge of the text is not fully highlighted #23

Closed derrikF closed 3 years ago

derrikF commented 3 years ago

Hello

in pdf, after recognition with pdf2pdfocr_gui.py, the right edge of the text is not fully highlighted if you select 'tesseract' for the '- e' option. and if you select 'native', then the entire text is highlighted correctly,but there are no Cyrillic characters. see the screenshots...

if you say that this is a tesseract problem, then this is incorrect, because I recognize djvu with 'ocrodjvu' and the text is highlighted correctly after recognition.

can this be corrected?

LeoFCardoso commented 3 years ago

Can you attach the input PDF?

derrikF commented 3 years ago

Can you attach the input PDF?

2020-11-07.pdf

LeoFCardoso commented 3 years ago

I got some kind of success with this command:

pdf2pdfocr.py -i 2020-11-07.pdf -l rus -v -f -g best

But this rebuild the original PDF, as without "-f" I'm receiving some kind of exception.

I will check in more depth. What's the page number you see the issue?

derrikF commented 3 years ago

I apologize, but this seems to be a problem with those programs that open pdf. other programs highlight the text correctly. you need to change the default pdf-viewer.

the only thing missing in your tool is to specify the page range, and not run through the entire document. and so great tool with GUI

or you can use "--page=" for "-x" ?

On Linux - Evince, FoxitReader - incorrectly highlight the text, WPS Rader, Master PDF Editor 5 - correctly

LeoFCardoso commented 3 years ago

I will check this suggestion and try to implement page range flag. "-x" will not work as it would be passed to all pages. Thank you for your issue.

LeoFCardoso commented 3 years ago

Issue fixed in last commit. Page range added to roadmap. Thank you!