Open GoogleCodeExporter opened 9 years ago
Here are my test (on Linux) with the latest code:
- psm from 0 to 4 produce no output:
- psm 5 output:
2
x
- psm 6 output:
2
x
- psm 7 output:
x2
- psm 8 output:
x2
- psm 9 output:
x2
- psm 10 output:
fi
I run it from command like this: 'tesseract superscript.jpg - -psm 8'
I tried it also on Windows 7 pro and I got the same results. So I can not
reproduce problem.
Original comment by zde...@gmail.com
on 17 Apr 2015 at 7:55
PSM_SINGLE_BLOCK (psm 6) is the problem. psm 7 and higher force horizontal text
no matter what but are not applicable when scanning pages
Original comment by Jimma...@gmail.com
on 17 Apr 2015 at 9:50
superscript.jpg is not page neither block (paragraph)!
If you instruct tesseract to analyze this image as several lines of text, it is
your request and not tesseract failure.
Original comment by zde...@gmail.com
on 18 Apr 2015 at 5:11
A better example is attached and using the command "tesseract superscript.jpg -
-psm 6" outputs:
"
aaaaa
2
ax
CCCCCC
"
Original comment by Jimma...@gmail.com
on 18 Apr 2015 at 5:37
Attachments:
I do not think it is correct typeset of superscript. You place 2 about x-height
which is IMO wrong. If you do line segmentation it could be placed on separated
line.
Have a look e.g. at wikipedia, how superscript should be typeset
http://en.wikipedia.org/wiki/Subscript_and_superscript
If I correct typeset (see attachment) I got correct result:
aaaaa
ax2
cccccc
Original comment by zde...@gmail.com
on 18 Apr 2015 at 7:16
Attachments:
Thanks for looking into this. Many fonts have the superscript above lowercase
letters by 1-2px as there is no followed standard to how far it should be. Is
there maybe an option to modify superscript detection parameters on tesseract?
Original comment by Jimma...@gmail.com
on 19 Apr 2015 at 5:21
I remember on tesseract forum somebody has problem that some diacritics mark
(usually placed above letter e.g. á) - tesseract place it on separated line.
There was solution to modified some parameter - unfortunately I can not find
this conversation.
I will try to have a look on this later, so I change status of issue open...
Original comment by zde...@gmail.com
on 19 Apr 2015 at 8:36
I tried searching and found this -
https://code.google.com/p/tesseract-ocr/issues/detail?id=877 .
textord_min_linesize seems to work but messes up the letter "a" even though it
is a perfect character. Any reason why?
Using the command "tesseract superscript.png - -psm 6 config.txt" with
config.txt having the contents "textord_min_linesize 2" it outputs:
"
aaaaa
3X2
CCCCCC
"
Original comment by Jimma...@gmail.com
on 20 Apr 2015 at 12:23
Original issue reported on code.google.com by
Jimma...@gmail.com
on 29 Jan 2015 at 6:52Attachments: