AmitGorvadiya / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

[Feature Request] Dictionaries should provide provide an easy way do identify them automatically #346

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Problem:
Currently it is hard to reliably detect installed dictionaries without 
iterating through a list of know / expected dictionaries, since the only 
information the dictionaries provide about themselves is the prefix of the 
filenames. 

Suggestion:
I'd suggest adding a file *.langinfo or similar containing at least:
- The full name of the language (possibily in the language itself)
- The iso 639-1 language code
The language code is especially useful for automatically setting up spellcheck 
without having to rely on a look-up table.
Please provide any additional information below.

Original issue reported on code.google.com by manisan...@gmail.com on 13 Aug 2010 at 3:40

GoogleCodeExporter commented 9 years ago
There's code to do something like this in the Android tree. Patches welcome.

Original comment by joregan on 18 Aug 2010 at 10:04

GoogleCodeExporter commented 9 years ago
Here is the code ported from Android tree. I've tested it only on Windows. The 
original code was compilable only on Linux so I hope I did not break it. :)

Original comment by max.mar...@gmail.com on 25 Aug 2010 at 8:00

Attachments:

GoogleCodeExporter commented 9 years ago
When I try to apply patch (patch -p1 
</opt2/svn/tesseract-ocr-read-only/add_lang.diff) I got error (on Linux 64bit):

patching file api/langinfo.h
patching file api/langinfo.cpp
patch unexpectedly ends in middle of line
patch: **** malformed patch at line 196: 

Can you please try to create patch one again or send me modified files?

Original comment by zde...@gmail.com on 25 Aug 2010 at 7:18

GoogleCodeExporter commented 9 years ago
ups... I had to learn how svn diff works with particular files instead of 
directly editing the patch. 

This is quite a big code churn so I thought I'll provide two diffs. The first 
contains only the language info api ported from android sources. The second 
contains the api integration into tesseract executable(proj and make file). 
It's been a while since the last time I worked with make files, so no guarantee 
there. 

Original comment by max.mar...@gmail.com on 27 Aug 2010 at 7:08

Attachments:

GoogleCodeExporter commented 9 years ago
There is diff utility also for windows: 
http://gnuwin32.sourceforge.net/packages/diffutils.htm

Than you can create patch with:
diff -Naur tesseract.org/ tesseract.new/ >your_patch.diff

Original comment by zde...@gmail.com on 27 Aug 2010 at 3:54

GoogleCodeExporter commented 9 years ago
there is not patch/file langinfo.h (that means after I applied you patches 
there is no file api/langinfo.h)

Original comment by zde...@gmail.com on 27 Aug 2010 at 4:15

GoogleCodeExporter commented 9 years ago
I extracted api/langinfo.h from your previous patch and I have to do following 
changes to compile it on linux:

In tesseractmain.cpp: I changed "LangInfo langInfo;" to "tesseract::LangInfo 
langInfo;"
In langinfo.cpp: I needed to declare dp: "struct dirent * dp;"
Also I need to add info about langinfo to Makefile.in.

In attachment you can find patch with all changes.

Than I try to run:
api/tesseract phototest.tif phototest -l eng
I finished with "Segmentation fault"
So I try to run installed version on other file:
/usr/local/bin/tesseract example.slk.tif example.slk
Than I got another error:
Cannot find language eng

So I played with tool strace and I find out it try to open file/directory 
"tessdata" in directory where I run command. 

In first case tessdata existed and tesseract crashed after it open this 
directory. I do not know if this is problem of my installation or used code. I 
will try to find it later.

In second case it did not find "tessdata" directory in current directory. And 
this is the greater problem from my point, because it should look also to 
system (installation) directory, to check environment variable TESSDATA_PREFIX, 
check HOME directory and than to try open in current directory.

Original comment by zde...@gmail.com on 27 Aug 2010 at 5:51

Attachments:

GoogleCodeExporter commented 9 years ago
I find reason why previous patch "Segfault" on linux - "dp" is mistype of "ent" 
;-)
Corrected patch is in attachment 

Original comment by zde...@gmail.com on 27 Aug 2010 at 6:39

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks Zdenko! I applied your tesseract-langinfo2.diff and it compiles and 
works as expected:
tesseract.exe ..\phototest.tif out -l aaaa
Cannot find language aaaa
The list of available languages:
eng

The only change I had to make in order to apply the patch is to rename my 
tesseract folder to tessseract-ocr.org. That's probably because I was using 
TortoiseSVN. 
I guess I'll give ubuntu one more try. :)

>it did not find "tessdata" directory in current directory. And this is the 
greater problem from my point
I agree. I've updated tesseractmain.cpp to test usage of the language api. 

Original comment by max.mar...@gmail.com on 28 Aug 2010 at 12:23

GoogleCodeExporter commented 9 years ago
It would be better if you implement argument that will produce list of 
available languages. Something like "tesseract.exe -a". 
And please make check for system/installation directory (if you need example 
search in tesseract source for "datadir").
Than post new patch. I will test it. This is very useful feature.

Original comment by zde...@gmail.com on 28 Aug 2010 at 9:04

GoogleCodeExporter commented 9 years ago
Looks cool. I think the best thing would be to wait for 3.01, and try to do it 
in a way that's compatible with the Android code (to not have duplicate 
functionality).

Original comment by joregan on 30 Sep 2010 at 12:43

GoogleCodeExporter commented 9 years ago
Issue 89 has been merged into this issue.

Original comment by zde...@gmail.com on 6 Aug 2011 at 11:35

GoogleCodeExporter commented 9 years ago
Here is patches to extend cpp api.
Source: https://groups.google.com/d/topic/tesseract-dev/J2-1budU2Bk/discussion

Original comment by zde...@gmail.com on 20 Aug 2012 at 3:01

Attachments:

GoogleCodeExporter commented 9 years ago
patch c-api (see issue 
http://code.google.com/p/tesseract-ocr/issues/detail?id=362) e.g. this patch 
depends on 001-tesseract-capi.patch and 002-tesseract-avail-lang-cppapi.patch

Original comment by zde...@gmail.com on 22 Aug 2012 at 7:41

Attachments:

GoogleCodeExporter commented 9 years ago
I've now tested these and I'm happy. :)

Original comment by JerseyChewi@gmail.com on 30 Aug 2012 at 8:33

GoogleCodeExporter commented 9 years ago
commited to svn. Please test it

Original comment by zde...@gmail.com on 24 Sep 2012 at 5:20

GoogleCodeExporter commented 9 years ago
The patch seems to have caused a new issue with tessdata path on Windows 
systems. It is reported here:

http://code.google.com/p/tesseract-ocr/issues/detail?id=764

Original comment by nguyen...@gmail.com on 27 Sep 2012 at 2:14

GoogleCodeExporter commented 9 years ago
The reported issue 764 has since been fixed.

Original comment by nguyen...@gmail.com on 28 Sep 2012 at 9:25

GoogleCodeExporter commented 9 years ago
Patch commited and additional issue fixed in r768

Original comment by zde...@gmail.com on 28 Sep 2012 at 9:43

GoogleCodeExporter commented 9 years ago
Usage of glob makes tesseract-android-tools unhappy as Android NDK doesn't have 
glob facility currently.

So tesseract-android-tools can't build the latest tesseract for now.

Original comment by ozan...@gmail.com on 26 Nov 2012 at 8:21

GoogleCodeExporter commented 9 years ago
Hmm. Probably need to add a AC_CHECK_FUNC for glob to configure.ac then.

Original comment by JerseyChewi@gmail.com on 26 Nov 2012 at 8:57

GoogleCodeExporter commented 9 years ago
also the same functionality can be implemented using opendir() & scandir() from 
dirent.h which is available in Android. I know that glob() is much more easier 
but since a simple filtering with scandir() would cost a few more lines with a 
cross-platform advantage.

Original comment by ozan...@gmail.com on 26 Nov 2012 at 2:16

GoogleCodeExporter commented 9 years ago
@ozancag:
Please create new issue regarding your/Android problem.
It would be great if you can submit patch that will work on Android...

Original comment by zde...@gmail.com on 26 Nov 2012 at 6:01

GoogleCodeExporter commented 9 years ago
Hi,

I opened a new issue 800 and posted a patch there.

Original comment by ozan...@gmail.com on 27 Nov 2012 at 1:24