LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
266 stars 33 forks source link

Missing space #3

Closed ericmoret closed 6 years ago

ericmoret commented 6 years ago

When I use pdf2pdfocr, the text generated includes no space between the words recognized. As a result when I copy/paste the resulting text it is difficult to use as I have to manually reintroduce all missing spaces.

LeoFCardoso commented 6 years ago

Hi Eric, thanks for the message. Please post the original PDF / image with this behavior. Please, also inform: operating system, python version, PDF reader. Doest "-w" flag generates a correct TXT file?

ericmoret commented 6 years ago

Hello Leo,

I cannot post the original PDF file. However here is what I observed: MacOS High Sierra 10.13.2, python 3.4.7 When I use the native Preview.app Version 10.0 (944.4), I see missing spaces after copy/paste. However when I use Adobe Acrobat Reader DC 2018.0009.20050, I see the expected spaces after copy/paste. the -w option also shows proper spacing in the output txt file

LeoFCardoso commented 6 years ago

Hi Eric, thanks! Please try to run with "-e native" / "-e tesseract" and with or without "-p" flag. I can copy / paste from Preview.app correctly in some (rare) cases. Let me know the results! Leo

LeoFCardoso commented 6 years ago

Closing this as it seems to be a "Preview.app" issue and we have a workaround.