unicharset_extractor generate a unicharset that need modification to train Farsi language

akorentlab / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

unicharset_extractor generate a unicharset that need modification to train Farsi language #811

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.generating tif and box pairs with BoxMaker.
2.tesseract per.BZar.exp0.tif per.BZar.exp0 nobatch box.train.stderr
3.unicharset_extractor per.BZar.exp0.box

What is the expected output? What do you see instead?
the resulting unicharset file is not suitable for continuing train process.
and we need to modify this file with this python procedure:
https://github.com/reza1615/PersianOcr/blob/master/Convertor%20unicharset%20to%2
0RTL.py
(also attached with a bit difference)

What version of the product are you using? On what operating system?
tesseract-ocr-3.02-win32
Win7 64bit

Please provide any additional information below.
I think we have the same issue in training Arabic .

Original issue reported on code.google.com by abidiash...@gmail.com on 21 Dec 2012 at 8:48

Attachments:

convertor.py

GoogleCodeExporter commented 9 years ago

excuse me , I must add in "What steps will reproduce the problem?" that you 
should  have farsi(persian) i.e. arabic characters in tif and box files.

Original comment by abidiash...@gmail.com on 22 Dec 2012 at 4:02

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

can you provide per.BZar.exp0.tif and correct per.BZar.exp0.box files?

Original comment by zde...@gmail.com on 3 Jan 2013 at 10:40

GoogleCodeExporter commented 9 years ago

Hallo guys,

I'm trying to train Tesseract for Kurdish, this is good too for the Persian, 
Kurdish has some more other letters, but the way of writing is the same as 
Arabic or Farsi. The problem I'm getting is that the final OCR result is not 
from right to left, but from left to right, which means that u can't read the 
text, but the letters r correct. I use  qt-box-editor to edit the box, then I 
use Serak tesseract Trainer V0.4 to train the OCR, after all I put the 
Traineddata file in the Tesseract dir., every thing goes well except the 
missing Arabic mechanism of writing from right to left.

Does any body know this peoblem?

You could see the traineddata file I generated as an attachment.

Thanks alot

Original comment by karo0...@gmail.com on 18 Oct 2013 at 7:41

Attachments:

ara1.traineddata

GoogleCodeExporter commented 9 years ago

Hello

Do As we did .

Original comment by abidiash...@gmail.com on 19 Oct 2013 at 9:43

GoogleCodeExporter commented 9 years ago

In 3.04 use set_unicharset_properties to do this.

Original comment by theraysm...@gmail.com on 4 Nov 2014 at 7:06