baopham1340 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

text2image issue with ligatures #1335

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi there, I see from the output of text2image a number of ligatures. Even when 
passing the --ligatures option (ff's etc are correctly converted to a single 
utf8 character) however I'm seeing a number of composited chars below in the 
unicharset_extractor output:

ft 3 0,68,206,255,72,373,0,42,114,373 Latin 79 0 47 ft  # ft [66 74 ]a
ti 3 58,69,206,255,39,353,0,47,86,353 Latin 26 0 22 ti  # ti [74 69 ]a
tı 3 58,69,174,254,60,473,0,47,107,473 Latin 26 0 22 tı # tı [74 131 ]a
tt 3 58,66,206,254,71,360,0,47,118,360 Latin 26 0 22 tt # tt [74 74 ]a
ttı 3 58,69,174,254,119,653,0,47,166,653 Latin 26 0 22 ttı  # ttı [74 74 131 
]a
tf 3 0,68,206,255,67,373,0,47,114,373 Latin 26 0 22 tf  # tf [74 66 ]a
tti 3 58,69,206,255,98,533,0,47,145,533 Latin 26 0 22 tti   # tti [74 74 69 ]a
fk 3 0,68,216,255,106,391,0,42,148,391 Latin 79 0 47 fk # fk [66 6b ]a
fı 3 0,69,174,255,61,486,0,42,103,486 Latin 79 0 47 fı  # fı [66 131 ]a
fî 3 0,69,216,255,60,392,0,42,102,392 Latin 79 0 47 fî  # fî [66 ee ]a
tî 3 58,69,206,255,59,379,0,47,106,379 Latin 26 0 22 tî # tî [74 ee ]a
fh 3 0,68,216,255,114,401,0,42,156,401 Latin 79 0 47 fh # fh [66 68 ]a

Is there an issue with this (ie should we strip them out or figure out how to 
box them properly in the text2image program) or are they ok to stay there?

Thanks,

Mark

Original issue reported on code.google.com by zea...@gmail.com on 10 Oct 2014 at 6:56

GoogleCodeExporter commented 9 years ago
Issue tracker is not an support (use user forum for asking questions). If you 
find a bug please submit an issue with test case to be able to replicate issue 
with current code. Otherwise your report is useless.

Original comment by zde...@gmail.com on 1 May 2015 at 11:17