DayBreakZhang / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
1 stars 0 forks source link

Include a C wrapper in TessBaseAPI (baseapi.cpp/.h) #362

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Several applications can only use C function calls in their interfacing with 
Tesseract. But according to http://code.google.com/p/tesseract-ocr/wiki/ReadMe, 
the DLL interface will no longer be supported in Tesseract 3.0.

This request asks to include a C wrapper to a global recognize class object in 
the TessBaseAPI. Basically, it calls for moving the C wrapper from the obsolete 
tessdll.cpp/.h to inside the baseapi.cpp/.h.

Original issue reported on code.google.com by nguyen...@gmail.com on 26 Sep 2010 at 4:20

GoogleCodeExporter commented 9 years ago
Delphi 7 and XP. :p

Original comment by rfwo...@gmail.com on 12 Jan 2011 at 12:25

GoogleCodeExporter commented 9 years ago
there is always a debate on c and c++. For DLLs and Static Objects, c is always 
preferred because it is faster and more portable. The ideal way to develop a 
software is to write in c then wrap it in c++. gtk is always following this 
direction while unfortunately tesseract 3 is heading for another.  
So, to wrap it to other programming language, say python, one may inevitably be 
required to go through some tedious steps to wrap the c++ class in tesseract 3 
back to c library. Of which, the especially troublesome class is "STRING" 
derived from apache.

rtwoolf> As I have used neither Delphi nor XP, it may take time for me to 
explore what is going on. Will keep u inform the progress.

Original comment by FreeT...@gmail.com on 12 Jan 2011 at 12:48

GoogleCodeExporter commented 9 years ago
"So, to wrap it to other programming language, say python, one may inevitably 
be required to go through some tedious steps to wrap the c++ class in tesseract 
3 back to c library."
I'm a little confused. by the looks of it you got everything working in Python, 
but you didn't recompile the DLL in C. 

Original comment by rfwo...@gmail.com on 12 Jan 2011 at 1:00

GoogleCodeExporter commented 9 years ago
I'm wondering whether it would be best to include SWIG wrappers for certain 
languages (Java, Python, C) with Tesseract itself or whether it would be better 
to maintain these separately. I suspect also having to support Java and Python 
would complicate the build process. Also, the C version requires a forked 
version of SWIG. Any thoughts, Jimmy?

Original comment by JerseyChewi@gmail.com on 12 Jan 2011 at 2:01

GoogleCodeExporter commented 9 years ago
Android already includes JNI bindings for tesseract; using SWIG for it would be 
an unneccessary duplication of effort. 

I really couldn't care less about Python, so I'm not going to register an 
opinion for or against, but I will note that the current trend seems to be to 
prefer using ctypes. 

The main reason why having a C binding is so appealing is that most other 
languages have facilities for using C libraries, which would reduce the effort 
in making language bindings all round, so it seemed like having a C wrapper was 
the best way in general. 

Original comment by joregan on 12 Jan 2011 at 2:25

GoogleCodeExporter commented 9 years ago
True, I'd forgotten about the Android port but I don't think this has been 
updated for Tesseract 3?

Original comment by JerseyChewi@gmail.com on 12 Jan 2011 at 3:04

GoogleCodeExporter commented 9 years ago
"The main reason why having a C binding is so appealing is that most other 
languages have facilities for using C libraries" - exactly!

Original comment by rfwo...@gmail.com on 12 Jan 2011 at 3:10

GoogleCodeExporter commented 9 years ago
I wasn't questioning whether we should have C bindings at all. It's just that 
SWIG supports quite a few languages and we'll get better results for less work 
if we target those directly instead of going via what it generates for C.

Original comment by JerseyChewi@gmail.com on 12 Jan 2011 at 3:17

GoogleCodeExporter commented 9 years ago
The Android port was updated to Tesseract 3 quite some time ago. Before 
Tesseract 3 was released, in fact.

Original comment by joregan on 12 Jan 2011 at 3:17

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hello, I tried to install the SWIG package and I am not able to build. There 
were lot of errors during the build. The last error was:

publictypes.h:96: error: âtesseract::PSM_COUNTâ has a previous declaration as 
âtesseract::PageSegMode tesseract::PSM_COUNTâ
error: command 'gcc' failed with exit status 1

I was using the command "python setup.py build". Any of you had a simular 
issue? Please advise. Thank you for your time.

Original comment by vijay111...@gmail.com on 19 Feb 2011 at 7:43

GoogleCodeExporter commented 9 years ago
svn changed. 

Try http://code.google.com/p/python-tesseract/downloads/list

Will look into it when I am free. 

Original comment by FreeT...@gmail.com on 20 Feb 2011 at 4:55

GoogleCodeExporter commented 9 years ago
I have tried the svn version of tesseract-ocr today vs the swig_svn.7z in 
http://code.google.com/p/python-tesseract/downloads/list. 

python setup.py build don't yield any problem. For your information, I am using 
Maverick Ubuntu

Original comment by FreeT...@gmail.com on 20 Feb 2011 at 5:18

GoogleCodeExporter commented 9 years ago
Hello everybody,

I have written a small C Wrapper (not complete but covers the most important 
part).
I would like to share it, and ideally it would be included in the project.

It is based on tesseract 3.01, so if there are any major changes in the C++ 
API, probably it would need some changes.

Comments are welcome!

Original comment by trop...@gmail.com on 2 Apr 2012 at 8:11

Attachments:

GoogleCodeExporter commented 9 years ago
BTW, those files just have to be added to the project alongside baseapi.h/.cpp

Original comment by trop...@gmail.com on 2 Apr 2012 at 8:25

GoogleCodeExporter commented 9 years ago
I put the files in api folder and included them in tesseract project (r639). 
However, the compiler generated > 100 errors, most of which are as following:

Error   1   error C2143: syntax error : missing ';' before 
'const' c:\projects\tesseract-3.0.1\api\capi.h  96  tesseract
Error   2   error C4430: missing type specifier - int assumed. Note: C++ does not 
support default-int c:\projects\tesseract-3.0.1\api\capi.h  96  tesseract
Error   3   error C2144: syntax error : 'void' should be preceded by 
';' c:\projects\tesseract-3.0.1\api\capi.h  98  tesseract
Error   5   error C2086: 'int TESSDLL_API' : 
redefinition    c:\projects\tesseract-3.0.1\api\capi.h  98  tesseract

Original comment by nguyen...@gmail.com on 3 Apr 2012 at 2:46

GoogleCodeExporter commented 9 years ago
Oh that's unfortunate. I made a last minute change without testing it.
Here's the new one.

Anyway, are you compiling the 3.01 project? In 3.01 (Windows) there is not yet 
a project for a dll, only the executable.
What I have done is just changed the "tesseract" project from "Executable" to 
"DLL" in the preferences and defined TESSDLL_EXPORTS also in the Project 
settings.
Additionally you probably want to change the output name from tesseract.exe to 
tesseract.dll or similar.

This is obviously a hack, in the long term you would a separate project for the 
DLL. But I believe this is already done in SVN.

Original comment by trop...@gmail.com on 3 Apr 2012 at 7:12

Attachments:

GoogleCodeExporter commented 9 years ago
I've been able to build the DLL with both 3.01 and, with little change, 3.02 
alpha. However, I'm not sure if it was built correctly as my Java program 
cannot look up the exposed C methods. I'll come back to it when I have more 
time.

Meanwhile, can you attach a copy of your C DLL so we can try out? Thanks.

Original comment by nguyen...@gmail.com on 4 Apr 2012 at 2:22

GoogleCodeExporter commented 9 years ago
You need to define TESSERACT_EXPORTS in the project properties, otherwise the C 
Functions are not exported.
I've attached a copy of my DLL. (One Debug and one Release)
Note that I've built the DLLs with VS 2005, so there is a bit of a dependency 
hell regarding MSVCRxx.dll. You need both, MSVRC80.dll (for the VS 2005 
compiled objects) and MSVCR90.dll (for the VS 2008 objects that came with the 
source).
For the release version, you probably already have those, for the debug version 
it's a bit more difficult.
Those files now are Release.Dynamic

Original comment by trop...@gmail.com on 5 Apr 2012 at 8:28

Attachments:

GoogleCodeExporter commented 9 years ago
And now Debug:
(omitted the PDB, its seems to be too large)

Original comment by trop...@gmail.com on 5 Apr 2012 at 8:30

Attachments:

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Thanks, Troplin. I tried all your files and suggestions but still nothing 
worked. After spending some time digging into the old tessdll source code of 
Tess 2.04, I made a single change to capi.h from:

define TESSDLL_CALL __stdcall

to:

define TESSDLL_CALL __cdecl

then I began to be able to call the exported C functions from my Java wrapper. 
I don't understand the significance of this change since I'm not a C/C++ 
developer.

In preliminary tests with Tess 3.02, the OCR output text appeared be accurate 
with the test images.

Original comment by nguyen...@gmail.com on 6 Apr 2012 at 9:15

GoogleCodeExporter commented 9 years ago
It is just a different calling convention.
If you want to call the function from Java, you need use the same calling 
convention as declared in the C-API.

What technique are you using for your Java Wrapper? JNI, JNA, or something 
other?
Usually you can declare the calling convention where you declare the function 
prototype.

_stdcall is the standard for all Microsoft Win32 APIs.
_cdecl is the standard for C programs

Both have there advantages and disadvantages, I think it's a matter of taste 
what to use.
_cdecl makes less problems in combination with product from Non-Microsoft 
vendors (e.g. MinGW, Java, etc)
_stdcall is better suited if you are calling from MS-Products (like .NET, VB6)

Original comment by trop...@gmail.com on 10 Apr 2012 at 9:34

GoogleCodeExporter commented 9 years ago
Which function in capi.h is called to do the OCR? TessBaseAPIProcessPages? Have 
u defined TESSDLL_INCLUDE_BASEAPI? 

Could u be kind enough to brief me how to make your java wrapper?

Original comment by FreeT...@gmail.com on 10 Apr 2012 at 1:25

GoogleCodeExporter commented 9 years ago
The functions in the C-API are the same as those in the C++ API.
Documentation is in baseapi.h.

I usually do the following sequence:
1. TessBaseAPICreate
2. TessBaseAPIInit3
3. TessBaseAPISetPageSegMode
4. TessBaseAPISetImage
5. TessBaseAPIRecognize
6. TessBaseAPIGetIterator
... (extract text from iterators)
X. TessBaseAPIDelete

Original comment by trop...@gmail.com on 10 Apr 2012 at 2:55

GoogleCodeExporter commented 9 years ago
python-tesseract for windows
http://python-tesseract.googlecode.com/files/python-tesseract-0.7.win32-py2.7.ex
e

Original comment by FreeT...@gmail.com on 11 Apr 2012 at 6:56

GoogleCodeExporter commented 9 years ago
A comment on TESSDLL_INCLUDE_BASEAPI and TESSDLL_INCLUDE_LEPTONICA:

TESSDLL_INCLUDE_BASEAPI:
Only define this, if you are using the C-API in C++.
If defined, all datatypes from the BaseAPI can be used. C and C++ API can be 
mixed freely.

TESSDLL_INCLUDE_LEPTONICA:
Enables the use of the Leptonica datatypes.

Original comment by trop...@gmail.com on 12 Apr 2012 at 9:14

GoogleCodeExporter commented 9 years ago
Here is new version of the C Wrapper.
Changes:
- Use __cdecl instead of __stdcall, this seems to be more convenient.
- Includes all functions using Leptonica datatypes per default.
- Forward declaration of Leptonica datatypes instead of header file inclusion.
- Added missing "SetVariable" function
- Use array-delete (delete []) instead of scalar delete for strings and int 
arrays.

Original comment by trop...@gmail.com on 12 Apr 2012 at 1:13

Attachments:

GoogleCodeExporter commented 9 years ago
Forget to thank trop for your good works. Late is better than never.

Thanks a lot.

Original comment by FreeT...@gmail.com on 12 Apr 2012 at 4:41

GoogleCodeExporter commented 9 years ago
Feedback anyone?
Are there any chances to integrate it into the main repo?
Any objections regarding the names or the coding style?

Original comment by trop...@gmail.com on 16 Apr 2012 at 7:55

GoogleCodeExporter commented 9 years ago
Troplin, will this C wrapper also work on Linux?

I'm developing a JNA wrapper based on this C API (http://tess4j.sf.net). I'm 
close to releasing a beta once I figure out why recognizing a rectangle cuts 
off some words at the right edge of the image.

Other than that, the capi.cpp/.h looks fine. IMHO, for it to be included in the 
current baseline, it needs to be updated to Tesseract 3.02 API, which changed a 
little bit from 3.01. Additionally, a short demo program (similar to dlltest in 
2.04) to test the C API would be nice to have.

Original comment by nguyen...@gmail.com on 16 Apr 2012 at 1:14

GoogleCodeExporter commented 9 years ago
Great!
I can do the update to the current SVN, that's no problem. Also the demo 
program if I find some free time. However it will be difficult to cover all 
functionality, the new C API is much bigger than the old one.

Original comment by trop...@gmail.com on 16 Apr 2012 at 2:56

GoogleCodeExporter commented 9 years ago
Oh and yes, it should also work on Linux.

Original comment by trop...@gmail.com on 16 Apr 2012 at 2:56

GoogleCodeExporter commented 9 years ago
Ok, I updated the files to SVN HEAD.

Just as a sidenote, I find those changes quite disruptive for such a small 
version increase!

Original comment by trop...@gmail.com on 17 Apr 2012 at 9:29

Attachments:

GoogleCodeExporter commented 9 years ago
Nguyen,
you are using the 3.02 SVN version for your JNA wrapper. Can your recommend 
that over 3.01 release?
Using 3.02 SVN would be much easier for me, because there is already a 
VS-Project for the DLL.

Original comment by trop...@gmail.com on 24 Apr 2012 at 9:45

GoogleCodeExporter commented 9 years ago
I only recently started working on the Java wrapper after you made the C 
wrapper available. Naturally, I use the current version of Tesseract, which 
happens to be 3.0.2 and which I know has undergone significant file 
restructuring recently to make it more maintainable and easier to build. I also 
understand that 3.0.2 was planned for release together with the upcoming 
release of a popular Linux distro. Let's hope that the C API is also included.

Original comment by nguyen...@gmail.com on 26 Apr 2012 at 12:00

GoogleCodeExporter commented 9 years ago
Well, for me as an "external" person, it is natural to work with the current 
stable version, which is 3.01.

Regarding the inclusion of the C API, I think just hoping is not enough.
Do you have some more concrete information about the release date?
Who is responsible for the decision? Maybe I can convince him/her.

Original comment by trop...@gmail.com on 26 Apr 2012 at 8:26

GoogleCodeExporter commented 9 years ago
Zdenko just asked the same question in the Forum.

http://groups.google.com/group/tesseract-ocr/t/ef8c6819fc5385f

Original comment by nguyen...@gmail.com on 27 Apr 2012 at 4:45

GoogleCodeExporter commented 9 years ago
So Ray Smith is the one?

BTW, I just ported the tesseractmain.cpp to C (using the C API), is that 
sufficient as an example?

Original comment by trop...@gmail.com on 27 Apr 2012 at 8:17

GoogleCodeExporter commented 9 years ago
And again a new version of the C API.
Changes:
- Converted the TessBaseAPIProcessPage and TessBaseAPIProcessPages functions to 
a pure C interface.
- Renamed TessBaseSetInputName to TessBaseAPISetInputName and 
TessBaseSetOutputName to TessBaseAPISetOutputName

Original comment by trop...@gmail.com on 27 Apr 2012 at 8:33

Attachments:

GoogleCodeExporter commented 9 years ago
And here's the C sample. You need the lates capi.h/.cpp for that.

Original comment by trop...@gmail.com on 27 Apr 2012 at 10:17

Attachments:

GoogleCodeExporter commented 9 years ago
I *strongly* suggest you move this entire discussion over to a new thread on 
the tesseract-dev group [1]. Not everyone follows all the issues here (and 
while I read earlier entries in this particular issue the auto-email system 
didn't seem to inform me of updates).

Creating an additional "official" API is a serious business. It would have to 
be vetted by more than just the people who happen to follow this issue. 

[1] http://groups.google.com/group/tesseract-dev

Original comment by tomp2...@gmail.com on 27 Apr 2012 at 8:27

GoogleCodeExporter commented 9 years ago
I improved a bit further by getting rid of references to STRING and FILE native 
types to make the C API more amenable to other languages, such as Java, which 
do not have direct equivalents.

I found that raw image data type would still occasionally crash Tesseract 
engine. I want to try the approach of converting the raw image to Pix to see if 
I can get more stability.

TessBaseAPISetImage(TessBaseAPI* handle, const unsigned char* imagedata, int 
width, int height, int bytes_per_pixel, int bytes_per_line)
{
    //handle->SetImage(imagedata, width, height, bytes_per_pixel, bytes_per_line);
    Pix *pix = convert2Pix(imagedata);
    TessBaseAPISetImage2(handle, pix);
}

but I'm not sure how to implement convert2Pix method and whether the rest of 
the parameters are needed in the Pix conversion.

Original comment by nguyen...@gmail.com on 5 May 2012 at 6:00

Attachments:

GoogleCodeExporter commented 9 years ago
No promises but the code to convert to Pix might be found in 
tesseract-android-tools.
http://code.google.com/p/tesseract-android-tools/source/browse/

Original comment by JerseyChewi@gmail.com on 5 May 2012 at 7:04

GoogleCodeExporter commented 9 years ago
Nguyen,
I can't think why the raw data should crash the Tesseract engine, but the fact 
that you ask if those additional parameters are necessary makes me a bit 
suspicious. Yes, they are necessary and it is not possible to convert the data 
to PIX without them, because without them you cannot possibly know how to 
interpret the data.
Did you really pass those correctly to the method in the cases that crash?

Original comment by trop...@gmail.com on 6 May 2012 at 4:06

GoogleCodeExporter commented 9 years ago
BTW, conversion from raw data to Pix should be straight forward:

TessBaseAPISetImage(TessBaseAPI* handle, const unsigned char* imagedata, int 
width, int height, int bytes_per_pixel, int bytes_per_line)
{
  Pix pix;
  memset(&pix, 0, sizeof(pix)); // Set all members to 0
  pix.data = imagedata;
  pix.w = width;
  pix.h = height;
  pix.d = bytes_per_pixel == 0 ? 1 : bytes_per_pixel * 8;
  if (bytes_per_line % 4 != 0)
    return; // ERROR: Lines must be word-aligned
  pix.wpl = bytes_per_line / 4;
  TessBaseAPISetImage2(handle, pix);
}

(not tested!)

As you can see, all parameters have some equivalent field in the Pix data 
structure. In the end, inside tesseract, the data from Pix is just converted 
back to the "raw" format. There's no magic going on with Pix.

Original comment by trop...@gmail.com on 7 May 2012 at 7:23

GoogleCodeExporter commented 9 years ago
of course it should be:

TessBaseAPISetImage2(handle, &pix);

Original comment by trop...@gmail.com on 7 May 2012 at 7:25

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I tried but could not get the code to compile. Finally, the crashing stopped 
when I switched the .traineddata file from 3.0 or 3.01 to one compatible with 
Tess 3.02. Thanks for going through all the troubles. The last uploaded version 
is still good and clean (not polluted with hacks). Hope the project owner will 
soon accept and incorporate into the main trunk.

And lastly, how to get the files compiled to generate libtesseract302.so on 
Linux? Has anyone tried?

Original comment by nguyen...@gmail.com on 19 Jun 2012 at 3:18

GoogleCodeExporter commented 9 years ago
I'm currently trying to compile it on my mac, but I haven't quite groked the 
makefile, could you share yours with us :)?

Original comment by janpaulb...@googlemail.com on 28 Jun 2012 at 10:59