AiPacino / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
2 stars 0 forks source link

Half lines with characters chopped incorrectly #1104

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Using Tesseract from SVN, r1050, on OS X.

What steps will reproduce the problem?
1. tesseract test-1391708318482.tmp.jpeg test1 -l fra segdemo inter
2. (also tried with TIFF to try to rule out source format problems)
3. Everything is OK except for three lines with characters being chopped in the 
middle

What is the expected output? What do you see instead?
The faulty lines are :
1. Line 2: "7 , PLACE FRANZ LIST"
2. Line 4: "TEL 01 48 78 77 90"
3. Line 11 at the end : "€ 16.00"

It would seem Tesseract thinks these are fixed-pitch lines (and they are, at 
least for lines 2 and 4), but the corresponding chopping is wrong.

I didn't want to upload a screenshot of the ScrollView tool to avoid adding too 
many files - if you want it, I can certainly do it.

And thanks for this wonderful software.

Original issue reported on code.google.com by pierre.q...@gmail.com on 10 Feb 2014 at 1:18

Attachments:

GoogleCodeExporter commented 9 years ago
Hi guys,

I've been doing some different tests with the same paper, and different 
pictures of it. Most times it manages to get to ~100% for half of the lines, 
but not always the same. Sometimes even lines, sometimes the odd ones. I don't 
know if I'm making sense here.

I'll have more time in the coming weeks to dig through the source. Could you 
tell me in which source file is the work for deciding which lines are 
considered in the same block, and should be applied the same fixed pitch?

In any case, I'll report what I find in this issue.

Many thanks!

Original comment by pierre.q...@gmail.com on 28 Feb 2014 at 12:53