dlareklami / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Words concatenated together in pdf output #1125

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
 - I have used tesseract to process a PNG file and produce both PDF output, and hocr output. The resulting files are both attached.

What is the expected output? What do you see instead?
- If you open the pdf attached in a viewer ( im using Preview on OSX) the 
search within PDF works really well and identifies individual highlights words. 
 When I try to use the mouse to select text and copy to clipboard, there are 
cases where the text that is copied conflates words together with no spaces.  
In the hOCR output those words are distinct (i.e. not concatenated together) . 
For example try copying the very first sentence in the attached pdf, what ends 
up on your clipboard is: 
"Theseandrelatedgeographicquestionsarefrequentlyaskedbypeoplefromvariousareasofe
xpertise".

What version of the product are you using? On what operating system?
- Using Tesseract 3.03 with Leptonica 1.70 call compiled from sources on Ubuntu 
12.04. 

Original issue reported on code.google.com by n...@talis.com on 26 Feb 2014 at 8:01

Attachments:

GoogleCodeExporter commented 9 years ago
I've just tested this using a different pdf viewer and the problem is no 
exhibited. This might be a defect in Preview on OSX as opposed to a defect in 
Tesseract.

Original comment by n...@talis.com on 27 Feb 2014 at 10:22

GoogleCodeExporter commented 9 years ago
I tested 12.png-ocr.pdf on windows at Adobe Reader XI, that words are not 
concatenated. I tried copy text to clipboard and paste it to document and the 
words were separated.

So I closed this issues as not tesseract issue but pdf reader issue.

Original comment by zde...@gmail.com on 27 Feb 2014 at 2:15

GoogleCodeExporter commented 9 years ago
Thank you thats perfect.

Original comment by n...@talis.com on 27 Feb 2014 at 2:22