"umlaute" are not properly regonized

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Start ocrfeed
2. Try to convert something with umlauten (ä,ö,ü etc) or ß (very common in
german, but also in many other languages)
3. you get "unreadable" character-clusters...

What is the expected output? What do you see instead?
I'd like to see ü,ä,ö etc. Right now it is unusable for me, which is a
pity, because I really like the idea!

What version of the product are you using? On what operating system?
ocrfeeder 0.4, on ubuntu 9.10, 32-bit

Please provide any additional information below.

Original issue reported on code.google.com by hanksch...@googlemail.com on 6 Dec 2009 at 5:44

GoogleCodeExporter commented 9 years ago

Hi,

The recognition of the text in OCRFeeder depends on the engine you use.

If you use Tesseract, edit it in Tools->OCR Engines and add this to end of the
Arguments field:

-l de

That should do the job.

Original comment by joaquimr...@gmail.com on 25 Jan 2010 at 7:51

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

Hi!

Thanks for your answer, but it doesn't work. The actual option for german in
tesseract-ocr is  

-l deu

but the umlaute aren't recognized that way. gocr doesn't show umlaute either.

I'll check with the new version asap.

Any ideas how cuneiform-linux could be used? (But that's probably worth another 
Issue)

so long
hank

Original comment by hanksch...@googlemail.com on 25 Jan 2010 at 10:24

GoogleCodeExporter commented 9 years ago

Hi again!

No change with version 0.6. - an example:

Akzeptanz von jugendlichen im Ã¶ffentlichen Raum geworben. Ãber den
FÃ¶rderfonds werden gezielt jugendprojekte entwickelt und unterstÃ¼tzt.

It should read:

Akzeptanz von jugendlichen im Öffentlichen Raum geworben. Über den
Förderfonds werden gezielt jugendprojekte entwickelt und unterstützt.

This is what I used in the arguments field:

$IMAGE $FILE -l deu ; cat $FILE.txt

Any change to fix that? OCRfeeder is using UTF-8, if I read the code correctly, 
is
there a way to use something else instead?

so long
hank

Original comment by hanksch...@googlemail.com on 26 Jan 2010 at 4:40

GoogleCodeExporter commented 9 years ago

Hi,

I think you may be right. It might be some problem with the encoding when 
reading the
file or displaying the contents. The UTF-8 though, should not be the problem.

I'll address that as soon as I can.

Thank you,

Original comment by joaquimr...@gmail.com on 26 Jan 2010 at 4:44

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Hi hank,

Could you please attach a file with an example of the text with umlauts that is
failing for you so I can focus on a real example and fix it?

Thank you,

Original comment by joaquimr...@gmail.com on 3 Mar 2010 at 2:07

GoogleCodeExporter commented 9 years ago

Hi Hank,

Even though you haven't sent the file I asked for in my previous message, I 
think I
have fixed the issue. Turns out that the encoding of the text was working only 
for
Ocrad and not working well for any other engine.

I fixed this and now the engines output is supposed to be in UTF-8. (many 
engines
allow a parameter to set this)

This will be available on the next release.

Original comment by joaquimr...@gmail.com on 4 Mar 2010 at 2:42

GoogleCodeExporter commented 9 years ago

Hi!

Sorry, didn't get to it yesterday - do you still need it? 

Would be great if you could fix that one! You've got some svn-version for 
testing?

so long
hank

Original comment by hanksch...@googlemail.com on 4 Mar 2010 at 4:33

GoogleCodeExporter commented 9 years ago

Hi Hank,

I got a git version :)

http://git.gnome.org/browse/ocrfeeder/

Let me know if this version already works for you.

Original comment by joaquimr...@gmail.com on 4 Mar 2010 at 4:36

GoogleCodeExporter commented 9 years ago

i!
I've found it, but it doesn't work on my new machine (AMD quad-core, usung 
Ubuntu
9.10, 32.bit version).

I get

Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/studio/widgetModeler.py", line 370, in
performBoxDetection
    self.performBoxDetectionForReviewer(image_reviewer)
  File "/usr/lib/pymodules/python2.6/studio/widgetModeler.py", line 384, in
performBoxDetectionForReviewer
    image_processor = ImageProcessor(image_reviewer.path_to_image, window_size)
  File "/usr/lib/pymodules/python2.6/feeder/imageManipulation.py", line 40, in __init__
    raise ImageManipulationError
feeder.imageManipulation.ImageManipulationError

with both 0.6.0, and the git.version - something missing? I'll try on my "old"
maschine, at least 0.6 was working (well, kind of ;-) )

so long
hank

Original comment by hanksch...@googlemail.com on 4 Mar 2010 at 5:37

GoogleCodeExporter commented 9 years ago

Hi,

Could you please attach the file you are trying to recognize? (so I can check 
if I
get the same error)

Original comment by joaquimr...@gmail.com on 4 Mar 2010 at 6:04

GoogleCodeExporter commented 9 years ago

Hi!

This is rather weird...
I gave it a try on the old machine, and it looks like ocrfeeder doesn't like  my
standard.tif-scans prouced by xsane... png-pictures work out just fine! Great! 
All
umlauts are recognised!

I attach the non-working .tiff, and the same in png.

so long
hank

Original comment by hanksch...@googlemail.com on 4 Mar 2010 at 6:48

Attachments:

test0001.tiff

GoogleCodeExporter commented 9 years ago

Opps, took the same twice - here's the png ...

Original comment by hanksch...@googlemail.com on 4 Mar 2010 at 6:50

Attachments:

test0002.png

GoogleCodeExporter commented 9 years ago

Hi Hank,

So, the problem was that the tiff image you're using is encoded in a way that 
is not
supported by Python's Imaging Library and when OCRFeeder attempts to open it, 
it will
give that error.

I tried converting it to "tiff" (I know...) using ImageMagick and it then 
works, IM
might use a different compressing algorithm.
To convert this image the way I did, simply enter:
$ convert test0001.tiff right_test.tiff

It's amazing that all the images I tried with OCRFeeder and all users 
considered, no
such error has ever been reported. I wonder how you are creating that image.

I'll close this as fixed because I have fixed the umlauts cases and the image 
format
problem is not a common use case, also one can always convert the images.
Nonetheless, when such occurs now, it will popup a warning dialog telling the 
user
that an error occurred and that the image used should be converted to an 
appropriate
format.

Cheers,

Original comment by joaquimr...@gmail.com on 5 Mar 2010 at 2:11

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

I mistakenly set it as won't fix...

Setting as fixed now..

Original comment by joaquimr...@gmail.com on 5 Mar 2010 at 2:12

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Hi!

The tiff was a standard scan from xsane; actually I had problems with that 
format
before, using tesseract, and had to convert those files, too (I trained 
tesseract to
recognise an old latin-german dictionary). 

I thought that was a tesseract-problem only, but the same message appeared 
trying to
use ocrad as engine for ocrfeeder.

thanks!

so long
hank

Original comment by hanksch...@googlemail.com on 5 Mar 2010 at 4:09

ZhangXinNan / ocrfeeder

"umlaute" are not properly regonized #8