Sicos1977 / TesseractOCR

A .net library to work with Google's Tesseract
169 stars 22 forks source link

Error on engine creation on .NET 8 in Linux environment (docker container) #62

Open francisco-coelho-external opened 1 month ago

francisco-coelho-external commented 1 month ago

Hi,

I have an error on engine creation that didn't happen on .NET 7 but occurs on .NET 8. I'm using the latest version 5.3.5. This happens when the app runs inside a Linux-based container (Ubuntu). On Windows machines, it all goes well.

The structure of the project (clipped): image

The error: Failed to find library 'libleptonica-1.83.1.dll.so' for platform x64

The code that raises the error: using var engine = new Engine(_tessdata, language, EngineMode.LstmOnly);

The stack trace: at TesseractOCR.InteropDotNet.LibraryLoader.LoadLibrary(String fileName, String platformName) at InteropRuntimeImplementer.LeptonicaApiSignaturesInstance.LeptonicaApiSignaturesImplementation..ctor(LibraryLoader loader) at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor) at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span 1 copyOfArgs, BindingFlags invokeAttr)

The .so files are being copied on the build and, like I said, it all worked before I upgraded to .NET 8.

I hope this can help you helping me.

Best regards, Francisco Coelho

Sicos1977 commented 1 month ago

Try setting the loglevel to debug and see where it tries to find the needed file.

francisco-coelho-external commented 1 month ago

Yes it tries to do it but fails even though the file being there.

2024-10-21 07:35:14.6554 INFO Program [Current OS is Unix ]
2024-10-21 07:35:14.6554 INFO Program [Current platform is x64 ]
2024-10-21 07:35:14.6554 DEBUG Program [Trying to load file from 'app/bin/Debug/net8.0/x64/libleptonica-1.83.1.dll.so' ]
2024-10-21 07:35:14.6554 DEBUG Program [Trying to load file from '/app/bin/Debug/net8.0/x64/libleptonica-1.83.1.dll.so' ]
2024-10-21 07:35:14.6554 INFO Program [Trying to load native library '/app/bin/Debug/net8.0/x64/libleptonica-1.83.1.dll.so' ]
2024-10-21 07:35:14.6785 ERROR Program [Failed to load native library '/app/bin/Debug/net8.0/x64/libleptonica-1.83.1.dll.so', set logging to debug level and check logging ]
2024-10-21 07:35:14.6785 DEBUG Program [Trying to load file from '/app/bin/Debug/net8.0/x64/libleptonica-1.83.1.dll.so' ]
2024-10-21 07:35:14.6785 INFO Program [Trying to load native library '/app/bin/Debug/net8.0/x64/libleptonica-1.83.1.dll.so' ]
2024-10-21 07:35:14.6905 ERROR Program [Failed to load native library '/app/bin/Debug/net8.0/x64/libleptonica-1.83.1.dll.so', set logging to debug level and check logging ]
2024-10-21 07:35:14.6905 DEBUG Program [Trying to load file from '/app/x64/libleptonica-1.83.1.dll.so' ]
2024-10-21 07:35:14.6905 INFO Program [Trying to load native library '/app/x64/libleptonica-1.83.1.dll.so' ]
Exception thrown: 'System.DllNotFoundException' in TesseractOCR.dll
2024-10-21 07:35:14.7039 ERROR Program [Failed to load native library '/app/x64/libleptonica-1.83.1.dll.so', set logging to debug level and check logging ]
2024-10-21 07:35:14.7043 INFO Program [Custom search path is not defined, skipping. ]
2024-10-21 07:35:14.7043 ERROR Program [Failed to find library 'libleptonica-1.83.1.dll.so' for platform x64 ]
Exception thrown: 'System.Reflection.TargetInvocationException' in System.Private.CoreLib.dll
An unhandled exception of type 'System.Reflection.TargetInvocationException' occurred in System.Private.CoreLib.dll: 'Exception has been thrown by the target of an invocation.'
Stack trace:
 >   at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)
 >   at System.Reflection.MethodBaseInvoker.InvokeWithOneArg(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
 >   at System.RuntimeType.CreateInstanceImpl(BindingFlags bindingAttr, Binder binder, Object[] args, CultureInfo culture)
 >   at TesseractOCR.InteropDotNet.InteropRuntimeImplementer.CreateInstance[T]()
 >   at TesseractOCR.Interop.LeptonicaApi.Initialize()
 >   at TesseractOCR.Interop.TessApi.Initialize()
 >   at TesseractOCR.Interop.TessApi.get_Native()
 >   at TesseractOCR.Engine.Initialize(String dataPath, String languages, EngineMode engineMode, IEnumerable`1 configFiles, IDictionary`2 initialValues, Boolean setOnlyNonDebugVariables, ILogger logger)
 >   at TesseractOCR.Engine.Initialize(String dataPath, List`1 languages, EngineMode engineMode, IEnumerable`1 configFiles, IDictionary`2 initialValues, Boolean setOnlyNonDebugVariables, ILogger logger)
 >   at TesseractOCR.Engine..ctor(String dataPath, Language language, EngineMode engineMode, IEnumerable`1 configFiles, IDictionary`2 initialOptions, Boolean setOnlyNonDebugVariables, ILogger logger)
Sicos1977 commented 1 month ago

I don't know how I can solve this for you, are you sure that the files are in one of those folders where TesseractOCR is looking and that they are compiled in the correct format? ... with this I mean x64 when your app is 64 bit and x86 when it is 32 bits.

francisco-coelho-external commented 1 month ago

This issue has been resolved. The error was that the library was not being recognized as valid. Inside the container, which is a Linux environment, the .so files are required, not the .dlls. I was having trouble getting the .so file right. Fortunately, one of my colleagues, who has a better understanding of how to build from source code in a Linux machine, came up with a working .so file. If I may ask for something, is that the .so files be delivered alongside the dlls, this would make integration in Linux environments much easier.

Thanks for your support!

Sicos1977 commented 1 month ago

I'm on a Windows machine and the problem is that I don't know how to compile the DLL's for Linux so If you could ask your collegues to supply me with the given .so files than I will include them in the nuget package.

At this right moment I'm making a new version that is compiled against this release --> https://github.com/tesseract-ocr/tesseract/releases/tag/5.4.1

So if your collegue could supply me with that version compiled for Linux than I will include them.

Please try to compile versions for x86 and x64 ... I know that x86 code is not used that much anymore but I try to also make it work if for whatever reason someone is stuck on a 32 bit machine.

Leftyx commented 1 month ago

I have tried to fetch the so libraries from an Ubuntu 22.04 (Jammy) and copy them in the container but the issue remains. I don't think those 2 libs: libleptonica-1.85.0.dll.so and libtesseract54.dll.so are enough to fix the issue. There are some other dependencies which must be installed.

Below you can find the docker commands I use to download the libraries from repo for v. 5.4.1 and use tesseract in a webapi.

## Switch to root user to install packages
USER root

# REPO with libraries for version 5.4.1 https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr5

RUN apt-get update && \
        apt-get install -y software-properties-common && \
    add-apt-repository -y ppa:alex-p/tesseract-ocr5 && \
        apt-get update && \
        apt-get install -y libleptonica-dev libtesseract-dev && \
    apt-get purge -y software-properties-common && \
    rm -rf /etc/apt/sources.list.d/alex-p-ubuntu-tesseract-ocr5-*.list && \
    apt-get autoremove -y && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/lib/x86_64-linux-gnu/libdl.so.2 /usr/lib/x86_64-linux-gnu/libdl.so

WORKDIR /app/x64

RUN ln -s /usr/lib/x86_64-linux-gnu/liblept.so.5 /app/x64/libleptonica-1.85.0.dll.so
RUN ln -s /usr/lib/x86_64-linux-gnu/libtesseract.so.5 /app/x64/libtesseract54.dll.so

# Switch back to the non-root user
USER app

The commands install software-properties-common which is needed for the following steps and it will be removed afterwords. We add the repository ppa:alex-p/tesseract-ocr5 and install the packages libleptonica-dev and libtesseract-dev. The rest is just cleanup to remove unecessary packages and keep only what is needed.

At the very bottom we create few symbolic links to leptonica and tesseract using the naming referenced in TesseracOCR.

Without this symlink we will have another error for libdl.so as tesseract (or leptonica?) is expecting that:

RUN ln -s /usr/lib/x86_64-linux-gnu/libdl.so.2 /usr/lib/x86_64-linux-gnu/libdl.so
Byihta commented 3 weeks ago

Hello, I am one of @francisco-coelho-external 's colleagues.

We run TesseractOCR on a container with a Debian image build from a dockerfile/yaml similar to @Leftyx 's in the way it install's the dependencies and links them. Since we updated our codebase to .NET8, however, we had to upgrade TesseractOCR as well. The problem here is that Debian's repositories (and Ubuntu's for that matter) have yet to upgrade their libleptonica packages to 1.83.1/1.84 in their stable sources. We have tested with using library packages from different Linux distro repositories as well as building said libraries from source, within different distros. After some research and testing we came to realize that .so files aren't necessarily portable between linux distros. For example the .so libleptonica1.83.1 built on Ubuntu wouldn't work on Debian when the same building procedure done in Debian would work. The same goes for using libleptonica's packages from Debian's repository vs Ubuntu's repository. Only the ones from Debian's repository worked in our Debian machine, when running TesseractOCR.

I believe this is due to differences in the packages of the different dependencies between different distros.

As such it will not help you for us to share the .so files with you as different Linux distros require different .so's.

On the other hand I believe it would make better sense for TesseractOCR to guarantee in its search path for the relevant .so files that it looks under the standard installation paths of libraries in Linux instead (see [https://www.baeldung.com/linux/check-shared-library-installed#standard-library-paths for example). As such, if the LoadLibrary function were to include in its search path for the .so files /usr/lib/x86_64-linux-gnu/, i.e in the case of libleptonica it could look for /usr/lib/x86_64-linux-gnu/usr/lib/x86_64-linux-gnu/libleptonica.so.6 or even /usr/lib/x86_64-linux-gnu/usr/lib/x86_64-linux-gnu/libleptonica.so (notice the filenames in the so's are different from the ones in the dlls and so is version numbering), a linux user of the TesseractOCR would simply need to install the correct versions of the dependencies packages (which might be provided in a README file for example), and then you would not need to include the .so files with the nuget, so its less work to you too. The user would on the other hand be responsible for installing the correct packages or building the dependencies themselves if their respective distros have yet to update their packages and they want to use the most recent TesseractOCR version, but I believe this is the usual way in Linux.