No space in output text for Hindi language

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.Running tesseract to identify the text by using my own hin.traineddata file.
2.
3.

What is the expected output? What do you see instead?
The words in the image are recognized (not accurately. Still working on that) 
but there's no space in the output text as there was in the image.

What version of the product are you using? On what operating system?
Tesseract 3.02

Please provide any additional information below.
I've edited the boxes using jTessEditor and used Serak trainer to use the box 
file to prepare the traineddata file.
The box file is attached below.

Original issue reported on code.google.com by sheekhaj...@gmail.com on 22 Feb 2014 at 9:22

Attachments:

hin.utsaah.exp0.box

GoogleCodeExporter commented 9 years ago

This is not correct issue report:
1. we do not provide support for custom training.
2. if you use 3rd party tools - you should contact its authors.
3. It is not clear how your box file is related to your issue (No space in 
output text for Hindi language)

Original comment by zde...@gmail.com on 2 Mar 2014 at 2:47

Changed state: Invalid

GoogleCodeExporter commented 9 years ago

I'd tried to run tesseract on some images through cmd earlier but it kept 
showing the error 'Invalid TIFF directory; tags aren't sorted in ascending 
order.' and 'unknown field with tag 20624 encountered.'
And for the image for which the output has been obtained there was no space in 
it.

Original comment by sheekhaj...@gmail.com on 2 Mar 2014 at 4:57

GoogleCodeExporter commented 9 years ago

Hi.

I'm encountering somewhat same problem for Urdu recognition.

Urdu language has a ligature. Ligature totally changes the word's shape. Plus 
urdu has uneven space gaps. 

In middle of text the spaces are ignored almost every time. According to my 
assumption there must be a flag that could set values for minimum "space gap". 
If i'm correct please let me know what is that flag if there is some other way 
to get around this problem please state that also.

Thank you in advance.
Muhammad Ali Shahzad
Assistant Software Architect
HPC & Computer Vision.

Original comment by sh.muham...@gmail.com on 6 Jun 2014 at 5:31

dlareklami / tesseract-ocr

No space in output text for Hindi language #1124