Wamae / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

3.01 gives poor results under CentOS 6. Tesseract 3.00 does not have this issue. #596

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Install Tesseract 3.01 on CentOS 6
2. Standard English trained data, 3.01 or 3.00
3. Execute tesseract "test.tif" "testout" -l eng

** This occurs on ALL images tested, it is not an issue with the test image. **

What is the expected output? What do you see instead?

Tesseract 3.01 example:

LTARMER MEANWELL was at one time a very rich 
man. He owned large ï¬elds, and had ï¬ne ï¬ocks of 
sheep, and plenty of money.

[Note: exactly the same using either 3.00 or 3.01 trained data.]

Tesseract 3.00 example:

LTARMER MEANWELL was at one time a very rich 
man. He owned large fields, and had fine flocks of 
sheep, and plenty of money.

[Note: using 3.00 trained data.]

What version of the product are you using? On what operating system?
Tesseract 3.01 & 3.00, CentOS 6, Leptonica 1.68

Please provide any additional information below.

Just to clarify, this is the same with all images tested. Images are 600dpi 
2500x4000, 1bit B&W with all noise removed. I have set the environment variable 
"export TESSDATA_PREFIX=/usr/local/share/tessdata ". Leptonica is installed 
successfully.

The poor accuracy occurs only on Tesseract 3.01. Tesseract 3.00 performs as 
usual.

Alasdair

Original issue reported on code.google.com by 569...@gmail.com on 17 Dec 2011 at 6:44

GoogleCodeExporter commented 9 years ago
please attache input image and not result.

Original comment by zde...@gmail.com on 17 Dec 2011 at 7:45

GoogleCodeExporter commented 9 years ago
I have attached an image. As I said, this happens with ALL images I have 
tested, which is around 20, all different pages from different documents, and 
all in 1-bit B&W high-res with noise removed.

Original comment by 569...@gmail.com on 17 Dec 2011 at 7:54

Attachments:

GoogleCodeExporter commented 9 years ago
I can not confirm this. See attachments. I created it this way:
tesseract 3.png 3-30 -l eng (tesseract 3.00 on windows XP)
tesseract 3.png 3-301 -l eng -psm 6 (tesseract 3.01 on windows XP)

In 3-30.txt are 2 mistakes - "em dash" is recognized as "-"
In 3-301.txt is 1 mistake - "e" is recognized as "c" (in word "gentleman").

All other chars are recognized identically (tested with kdiff).

I just make quick "crosscheck" on Mandrivalinux 2011 (tesseract build from 
source) and I got the same result for 3.01 version as on Windows. I do not have 
3.00 revision on Linux, but I expect to get same result as on Windows.

Original comment by zde...@gmail.com on 17 Dec 2011 at 1:30

Attachments:

GoogleCodeExporter commented 9 years ago
Hello, thanks for your investigation of this issue.

I am aware that on other OS builds it appears to function normally. The issue 
occurs only on CentOS 6. Tesseract 3.01 appears to have a problem on CentOS 6.

Original comment by 569...@gmail.com on 18 Dec 2011 at 6:15

GoogleCodeExporter commented 9 years ago
Sorry but then this is not tesseract problem ;-)
I build tesseract 3.00 (r525) on Mandrivalilnux and I got the same results as 
on Windows... see attachment.

Original comment by zde...@gmail.com on 18 Dec 2011 at 6:02

Attachments:

GoogleCodeExporter commented 9 years ago
Hello, you will need to replicate it on CentOS 6. The fact that Tesseract 3.01 
runs without displaying errors but gives very poor results means that something 
is not working as it is supposed to. The fact that 3.00 does not have this 
issue shows that it IS something in the Tesseract 3.01 code which is causing 
this to happen.

I very much appreciate you trying to get to the bottom of this, but if you want 
to replicate it you can only do so on CentOS 6. I am not an idiot, this is a 
real issue and it is a Tesseract 3.01 issue which has been caused by some of 
the code changes between 3.00 & 3.01 somehow becoming incompatible with CentOS 
6.

Original comment by 569...@gmail.com on 19 Dec 2011 at 1:36

GoogleCodeExporter commented 9 years ago
I do not want to argue, but let's make an abstraction:
if on different environments (Windows, Mandrivalinux and iOS as explained on 
forum) code replays "2" for question how much is "1 + 1" but on your 
environment (CentOS 6) you got "3", than I would not blame code but environment 
;-)... 

I have access to one CentOS 5.6 (or 5.5) server so I will try it later there, 
but I am not able to upgrade it to 6.0...

Original comment by zde...@gmail.com on 19 Dec 2011 at 7:41

GoogleCodeExporter commented 9 years ago
Don't get me wrong, I appreciate your help. Only I have a fairly good idea 
about the conditions that cause this issue and you appear to have been ignoring 
this, testing the image (which has no problem) on a different environment to 
mine, and then reporting that there is no problem... the issue does not appear 
except on CentOS 6 as far as I know (it would be worth testing the equivalent 
Red Hat version) and I explained this very clearly.

Unlikely it will be possible to replicate on CentOS 5, as there have been 
significant changes since that release. I understand your abstraction, and I 
agree that something in CentOS 6 is not acting as predicted, but it is a code 
change that has caused Tesseract to not function correctly on some 
environments, I would call this a bug. GCC is version 4.4.4-13, if that helps.

Original comment by 569...@gmail.com on 19 Dec 2011 at 8:00

GoogleCodeExporter commented 9 years ago
I would really appreciate a little guidance on why the accuracy of my 
installation is so poor.  I have scanned the example above (3.png) and run both 
3.00 and 3.01 on it, both of which give awful results - please see attached.
I installed 3.00 after reading this issue in the hope that it would be better 
than 3.01, but to no avail.
I am using Centos 5.4 and the default English trained data for both versions.
What am I doing wrong please?

Original comment by bamfords...@gmail.com on 4 Apr 2012 at 4:44

Attachments:

GoogleCodeExporter commented 9 years ago
Issue 671 has been merged into this issue.

Original comment by zde...@gmail.com on 5 Apr 2012 at 5:21

GoogleCodeExporter commented 9 years ago
I wonder why these two issues have been merged?  My concern is that they
are felt to be the same which they are certainly not.
The only reason I referenced 596 from 671 was to use his output in my
illustration!

 Otherwise, they are from opposite ends of the spectrum - the original 596
was getting EXCELLENT results with 3.00 and suffered a degradation in
quality with 3.01.  I, on the other hand am getting AWFUL results with both
and want to know what we are doing differently.

I hope someone can please shed some light.

Thanks for your attention on this matter.

- Chris

Original comment by bamfords...@gmail.com on 6 Apr 2012 at 9:25

GoogleCodeExporter commented 9 years ago
Chris, 
1. this kind of problem (poor quality of 3.01) is reported ONLY by CentOS 
user(s).
2. As you see on other systems I got almost the same result for 3.00 and 3.01 
version. So the result of user 569234 is strange for me and there could be 
something else...
3. If I get on Windows XP (32bit), Mandrivalinux 2011 64bit, openSUSE 12.1 
64bit good results (as far as I remember the same), why CentOS users get wrong 
results (for the same image with the same tesseract version?

For me this is the CentOS issue in both cases (or maybe hardware or something 
else). I believe that if we find reason for issue 596, 671 will be solved too.

Original comment by zde...@gmail.com on 6 Apr 2012 at 10:14

GoogleCodeExporter commented 9 years ago
Thanks, that makes sense.  Sounds like it could be the underlying libraries 
perhaps - I will do some more investigations on Ubuntu today and see what comes 
up.
Chris

Original comment by bamfords...@gmail.com on 10 Apr 2012 at 9:17

GoogleCodeExporter commented 9 years ago
I have just tried Ubuntu 11.10 32-bit and it gives me identical results, so its 
not just a CentOS issue any more!
I will try 64-bit next.

Original comment by bamfords...@gmail.com on 10 Apr 2012 at 11:42

GoogleCodeExporter commented 9 years ago
confirmed on centos 6. 
i have 3.01
cuneiform also gives strange results :( but i am using good sample data 
pictures.

Original comment by pm.essen...@gmail.com on 21 Jun 2012 at 7:50

GoogleCodeExporter commented 9 years ago
Issue 498 has been merged into this issue.

Original comment by zde...@gmail.com on 24 Jul 2012 at 8:00

GoogleCodeExporter commented 9 years ago
Since #14 also mentioned ubuntu, this might help:

I've tested tesseract v3.02 with Leptonica on ubuntu server 12.10 amd64 
(packages from ubuntu tesseract-ocr 3.02.01-6 with liblept3 1.69-3.1ubuntu1)

Results with the above provided 3.png imho look fine, text file attached. 
(Only the C in gentleman - like in #3)

Original comment by andreas....@gmail.com on 24 Jan 2013 at 1:26

Attachments:

GoogleCodeExporter commented 9 years ago
Any proposal for Centos users? I also have bad results in Centos 6.4 with 
Leptonica 1.67 and Tesseract 3.0.0

Original comment by rodrigo...@gmail.com on 13 May 2013 at 2:36

GoogleCodeExporter commented 9 years ago
I've tested this png (3.png from #2) on CentOS 6.4 (on x86_64).

Result is the same as in #17, byte-to-byte.

Env: 
- Linux 2.6.32-358.18.1.el6.x86_64
- leptonica 1.69-1.el6.x86_64
- tesseract 3.02-1.el6.x86_64

P. S. I used rpms build from my repo 
(https://github.com/grossws/tesseract-ocr-specs)

Original comment by gros...@gmail.com on 26 Nov 2013 at 7:20

Attachments:

GoogleCodeExporter commented 9 years ago
@grossws: thanks.

I am closing this issues as "WorkForMe"(Can't reproduce the Bug), because it 
does not look like tesseract-ocr issue (it is not possible to reproduce it, 
there are report with good results...)

Also I would like to point to forum discussions[1][2] that explains that 
different compilers and optimization cause different OCR result.

[1] https://groups.google.com/d/msg/tesseract-ocr/zkk9Dl86aL8/tOa1lSBmYAsJ
[2] https://groups.google.com/d/msg/tesseract-ocr/Y4TGo_Qo8Vw/m76rM_JzFLkJ

Original comment by zde...@gmail.com on 26 Nov 2013 at 8:13

GoogleCodeExporter commented 9 years ago
Issue 1497 has been merged into this issue.

Original comment by zde...@gmail.com on 20 Jul 2015 at 8:10