Closed GoogleCodeExporter closed 9 years ago
please attache input image and not result.
Original comment by zde...@gmail.com
on 17 Dec 2011 at 7:45
I have attached an image. As I said, this happens with ALL images I have
tested, which is around 20, all different pages from different documents, and
all in 1-bit B&W high-res with noise removed.
Original comment by 569...@gmail.com
on 17 Dec 2011 at 7:54
Attachments:
I can not confirm this. See attachments. I created it this way:
tesseract 3.png 3-30 -l eng (tesseract 3.00 on windows XP)
tesseract 3.png 3-301 -l eng -psm 6 (tesseract 3.01 on windows XP)
In 3-30.txt are 2 mistakes - "em dash" is recognized as "-"
In 3-301.txt is 1 mistake - "e" is recognized as "c" (in word "gentleman").
All other chars are recognized identically (tested with kdiff).
I just make quick "crosscheck" on Mandrivalinux 2011 (tesseract build from
source) and I got the same result for 3.01 version as on Windows. I do not have
3.00 revision on Linux, but I expect to get same result as on Windows.
Original comment by zde...@gmail.com
on 17 Dec 2011 at 1:30
Attachments:
Hello, thanks for your investigation of this issue.
I am aware that on other OS builds it appears to function normally. The issue
occurs only on CentOS 6. Tesseract 3.01 appears to have a problem on CentOS 6.
Original comment by 569...@gmail.com
on 18 Dec 2011 at 6:15
Sorry but then this is not tesseract problem ;-)
I build tesseract 3.00 (r525) on Mandrivalilnux and I got the same results as
on Windows... see attachment.
Original comment by zde...@gmail.com
on 18 Dec 2011 at 6:02
Attachments:
Hello, you will need to replicate it on CentOS 6. The fact that Tesseract 3.01
runs without displaying errors but gives very poor results means that something
is not working as it is supposed to. The fact that 3.00 does not have this
issue shows that it IS something in the Tesseract 3.01 code which is causing
this to happen.
I very much appreciate you trying to get to the bottom of this, but if you want
to replicate it you can only do so on CentOS 6. I am not an idiot, this is a
real issue and it is a Tesseract 3.01 issue which has been caused by some of
the code changes between 3.00 & 3.01 somehow becoming incompatible with CentOS
6.
Original comment by 569...@gmail.com
on 19 Dec 2011 at 1:36
I do not want to argue, but let's make an abstraction:
if on different environments (Windows, Mandrivalinux and iOS as explained on
forum) code replays "2" for question how much is "1 + 1" but on your
environment (CentOS 6) you got "3", than I would not blame code but environment
;-)...
I have access to one CentOS 5.6 (or 5.5) server so I will try it later there,
but I am not able to upgrade it to 6.0...
Original comment by zde...@gmail.com
on 19 Dec 2011 at 7:41
Don't get me wrong, I appreciate your help. Only I have a fairly good idea
about the conditions that cause this issue and you appear to have been ignoring
this, testing the image (which has no problem) on a different environment to
mine, and then reporting that there is no problem... the issue does not appear
except on CentOS 6 as far as I know (it would be worth testing the equivalent
Red Hat version) and I explained this very clearly.
Unlikely it will be possible to replicate on CentOS 5, as there have been
significant changes since that release. I understand your abstraction, and I
agree that something in CentOS 6 is not acting as predicted, but it is a code
change that has caused Tesseract to not function correctly on some
environments, I would call this a bug. GCC is version 4.4.4-13, if that helps.
Original comment by 569...@gmail.com
on 19 Dec 2011 at 8:00
I would really appreciate a little guidance on why the accuracy of my
installation is so poor. I have scanned the example above (3.png) and run both
3.00 and 3.01 on it, both of which give awful results - please see attached.
I installed 3.00 after reading this issue in the hope that it would be better
than 3.01, but to no avail.
I am using Centos 5.4 and the default English trained data for both versions.
What am I doing wrong please?
Original comment by bamfords...@gmail.com
on 4 Apr 2012 at 4:44
Attachments:
Issue 671 has been merged into this issue.
Original comment by zde...@gmail.com
on 5 Apr 2012 at 5:21
I wonder why these two issues have been merged? My concern is that they
are felt to be the same which they are certainly not.
The only reason I referenced 596 from 671 was to use his output in my
illustration!
Otherwise, they are from opposite ends of the spectrum - the original 596
was getting EXCELLENT results with 3.00 and suffered a degradation in
quality with 3.01. I, on the other hand am getting AWFUL results with both
and want to know what we are doing differently.
I hope someone can please shed some light.
Thanks for your attention on this matter.
- Chris
Original comment by bamfords...@gmail.com
on 6 Apr 2012 at 9:25
Chris,
1. this kind of problem (poor quality of 3.01) is reported ONLY by CentOS
user(s).
2. As you see on other systems I got almost the same result for 3.00 and 3.01
version. So the result of user 569234 is strange for me and there could be
something else...
3. If I get on Windows XP (32bit), Mandrivalinux 2011 64bit, openSUSE 12.1
64bit good results (as far as I remember the same), why CentOS users get wrong
results (for the same image with the same tesseract version?
For me this is the CentOS issue in both cases (or maybe hardware or something
else). I believe that if we find reason for issue 596, 671 will be solved too.
Original comment by zde...@gmail.com
on 6 Apr 2012 at 10:14
Thanks, that makes sense. Sounds like it could be the underlying libraries
perhaps - I will do some more investigations on Ubuntu today and see what comes
up.
Chris
Original comment by bamfords...@gmail.com
on 10 Apr 2012 at 9:17
I have just tried Ubuntu 11.10 32-bit and it gives me identical results, so its
not just a CentOS issue any more!
I will try 64-bit next.
Original comment by bamfords...@gmail.com
on 10 Apr 2012 at 11:42
confirmed on centos 6.
i have 3.01
cuneiform also gives strange results :( but i am using good sample data
pictures.
Original comment by pm.essen...@gmail.com
on 21 Jun 2012 at 7:50
Issue 498 has been merged into this issue.
Original comment by zde...@gmail.com
on 24 Jul 2012 at 8:00
Since #14 also mentioned ubuntu, this might help:
I've tested tesseract v3.02 with Leptonica on ubuntu server 12.10 amd64
(packages from ubuntu tesseract-ocr 3.02.01-6 with liblept3 1.69-3.1ubuntu1)
Results with the above provided 3.png imho look fine, text file attached.
(Only the C in gentleman - like in #3)
Original comment by andreas....@gmail.com
on 24 Jan 2013 at 1:26
Attachments:
Any proposal for Centos users? I also have bad results in Centos 6.4 with
Leptonica 1.67 and Tesseract 3.0.0
Original comment by rodrigo...@gmail.com
on 13 May 2013 at 2:36
I've tested this png (3.png from #2) on CentOS 6.4 (on x86_64).
Result is the same as in #17, byte-to-byte.
Env:
- Linux 2.6.32-358.18.1.el6.x86_64
- leptonica 1.69-1.el6.x86_64
- tesseract 3.02-1.el6.x86_64
P. S. I used rpms build from my repo
(https://github.com/grossws/tesseract-ocr-specs)
Original comment by gros...@gmail.com
on 26 Nov 2013 at 7:20
Attachments:
@grossws: thanks.
I am closing this issues as "WorkForMe"(Can't reproduce the Bug), because it
does not look like tesseract-ocr issue (it is not possible to reproduce it,
there are report with good results...)
Also I would like to point to forum discussions[1][2] that explains that
different compilers and optimization cause different OCR result.
[1] https://groups.google.com/d/msg/tesseract-ocr/zkk9Dl86aL8/tOa1lSBmYAsJ
[2] https://groups.google.com/d/msg/tesseract-ocr/Y4TGo_Qo8Vw/m76rM_JzFLkJ
Original comment by zde...@gmail.com
on 26 Nov 2013 at 8:13
Issue 1497 has been merged into this issue.
Original comment by zde...@gmail.com
on 20 Jul 2015 at 8:10
Original issue reported on code.google.com by
569...@gmail.com
on 17 Dec 2011 at 6:44