baopham1340 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

How to edit box file for paragraph image #1464

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1.Writing a paragraph in a notepad file
2.Make image of that notepad file
3.Generate box file from that image

What is the expected output? What do you see instead?
If the paragraph contains 80 characters, there should be 80 characters in the 
box file so that I can sequentially edit however, the box file generates for 
example, 60 characters as it does not recognize the spacing among characters.

Please use labels and text to provide additional information.
I am developing it for bangla language.I know there are existing traindata for 
bangla but I am doing it as a course requirement.

I have given a screenshot of a single line paragraph where there are 38 
characters, I have shown them with space as others can understand, now I want 
tesseract to generate 38 characters in box file but it definitely does not 
recognize Bangla, so what to do?

Original issue reported on code.google.com by m.tawfi...@gmail.com on 26 Apr 2015 at 5:28

Attachments:

GoogleCodeExporter commented 8 years ago
Read the wikis and forums before posting issues!

Original comment by zde...@gmail.com on 27 Apr 2015 at 6:45

GoogleCodeExporter commented 8 years ago
Issue 1441 has been merged into this issue.

Original comment by zde...@gmail.com on 27 Apr 2015 at 6:46

GoogleCodeExporter commented 8 years ago
Hey, Thanks for your advice, by the way, can you please tell how to train 
tesseract so that it can understand word spacing, my traindata cannot recognize 
word spacing so there is no space between characters generated in the outout 
file, somethig like

The sky is cloudy
Theskyiscloudy.

Thanks again.

Original comment by m.tawfi...@gmail.com on 28 Apr 2015 at 2:51