CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
715 stars 425 forks source link

Sad situation with Windows + OCR #1254

Open cfsmp3 opened 4 years ago

cfsmp3 commented 4 years ago

While testing a previous ticket regarding hardsubx on Windows, on master. Running this exact version, just compiled:

CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.88
        Git commit: Unknown
        Compilation date: Unknown
        File SHA256: 0a40241ddd609f5272f063d25e0f2c29c2192187aabd2592da98909463b88541
Libraries used by CCExtractor
        Tesseract Version: 4.00.00dev
        Leptonica Version: leptonica-1.74 (Dec 31 2016, 12:28:35) [MSC v.1900 LIB Debug x86]
        libGPAC Version: 0.7.2-DEV
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.35
        FreeType
        libhash
        nuklear
        libzvbi

First, the reports, as usual about eng.traineddata couldn't suck more.

CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
eng.traineddata not found! No Switching Possible

Seriously, would it kill us to tell the user WHERE we expect that file to be present?

OK So since I didn't remember how this worked at all I started looking into the code a bit. We do look TESSDATA_PREFIX amount other places /usr/share. Wait what? This is Windows! Why are we looking there? Also I see lots of / as path separator, but Windows uses . Is this portable at all?

OK, so I set set the env variable:

set TESSDATA_PREFIX=C:\Downloads

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>dir c:\Downloads\tessdata
 Volume in drive C has no label.
 Volume Serial Number is 3A55-62AE

 Directory of c:\Downloads\tessdata

12-Apr-20  14:47    <DIR>          .
12-Apr-20  14:47    <DIR>          ..
12-Apr-20  14:46        23,466,654 eng.traineddata
               1 File(s)     23,466,654 bytes
               2 Dir(s)  92,672,598,016 bytes free

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
eng.traineddata not found! No Switching Possible

Still not working. Problem now is that I'm missing a \ at the end of the end variable.

OK so let's set it correct:

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>set TESSDATA_PREFIX=C:\Downloads\

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
lstm_recognizer_->DeSerialize(tessdata_manager.swap(), &fp):Error:Assert failed:in file C:\Users\HOME\.cppan\storage\src\42\9e\ba91\ccmain\tessedit.cpp, line 202

So now apparently it starts at least, but then it crashes.

We just need to work on OCR + Windows.

In my opinion, at the very least:

1) Proper information to the user, including which paths are being searched. And where do the errors come from? Is it tesseract, or us? Are we bailing out before even giving tesseract a try? 2) Update tesseract 4 to last version OR downgrade to 3. But using 4.00.00 is ridiculous! It's buggy. 3) Check if there's an officially compiled binary we can use. I remember we did our own thing a long time ago. Still needed?

Labelling HARD because we seem to be unable to fix it once and for all.

cc: @ShraxO1

NilsIrl commented 4 years ago

1170

NilsIrl commented 4 years ago

Jokes aside, try PR #1170 on windows, might solve the problem. Also play around with deleting the data directory. You might fall on a message by tesseract that says where it looked for the data and didn't find it. If this message is not to your liking, it can be modified/suppressed using the dup2 syscall.

EDIT: tesseract may have an API which wouldn't require the use of dup2.

apovalyaev commented 4 years ago

Issue might be the result of not-properly builded solution. The issue should not appear if the solution is built properly. I checked it within VS2015 and VS2019 (default SDKs are used) and have not faced out such kind of issue.

apovalyaev commented 4 years ago

We just need to work on OCR + Windows.

In my opinion, at the very least:

1. Proper information to the user, including which paths are being searched. And where do the errors come from? Is it tesseract, or us? Are we bailing out before even giving tesseract a try?

2. Update tesseract 4 to last version OR downgrade to 3. But using 4.00.00 is ridiculous! It's buggy.

3. Check if there's an officially compiled binary we can use. I remember we did our own thing a long time ago. Still needed?

Labelling HARD because we seem to be unable to fix it once and for all.

cc: @ShraxO1

Now after "Update VS project build settings" we can use the following steps (which automatically takes the last version of tesseract, as for now it is tesseract-4.1.1)

Build steps which use last version of Tesseract: 1) Clone repository https://github.com/CCExtractor/ccextractor 2) Setting up vcpkg: 2.1) git clone https://github.com/Microsoft/vcpkg.git

cd vcpkg PS> .\bootstrap-vcpkg.bat 2.2) Modify vcpkg/triplets/x86-windows.cmake set(VCPKG_CRT_LINKAGE static) set(VCPKG_LIBRARY_LINKAGE static) 3) Installing the last verified version of tesseract NOTE: Now it is tesseract-4.1.1 vcpkg install tesseract:x86-windows vcpkg integrate install 4) Building the solution

So, further steps: 1) It make sense to update auto-build scripts, so that auto-build takes also the last verified version of tesseract; 2) Something else?

cfsmp3 commented 4 years ago

I'd say there's something missing here. I followed your instructions, no errors (good), but the binary is still using tessearct-4.00dev, which makes sense - why would it pick any other version if that's the one we have inside the project?

apovalyaev commented 4 years ago

Let's check if we are on the same page: 1) Per my understanding, we have only got ffmpeg libraries precompiled inside project (directory windows/libs/lib/); So, when just being cloned, ccextractor should not be built unless there is already some other "copy" of tesseract library which is installed not through vcpkg. If the project compiled fine before you issued "vcpkg install tesseract:x86-windows" command, it means you have already installed some other copy of tesseract. It makes sense to remove it; 2) The other possible reason is what particular version of tesseract vcpkg has installed. You can use command "vcpkg list" to check what version of tesseract you have installed on your PC;

canihavesomecoffee commented 4 years ago

@apovalyaev Tesseract is included in those "cppan" dependencies; refer to https://github.com/CCExtractor/ccextractor/tree/master/windows/libs/lib/release-lib

Refer to https://github.com/CCExtractor/ccextractor/pull/592 for the PR, and maybe @Izaron could explain a bit if needed?

apovalyaev commented 4 years ago

@apovalyaev Tesseract is included in those "cppan" dependencies; refer to https://github.com/CCExtractor/ccextractor/tree/master/windows/libs/lib/release-lib

Refer to #592 for the PR, and maybe @Izaron could explain a bit if needed?

I've taken a look #592 to see that tesseract was manually compiled. 1) #592 was closed at the beginning of 2017, so it might be outdated a little bit; 2) From the other hand, it looks like there is a bug in cppan dependencies (which was discovered while running with "-hardsubx" option;

I can see two ways: 1) Remove "ccpan" dependencies to see how it will work within vcpkg; 2) Rebuild "ccpan" libraries with a newer version of tesseract; Some others? What would be the best fit?

cfsmp3 commented 4 years ago

I can see two ways:

  1. Remove "ccpan" dependencies to see how it will work within vcpkg;
  2. Rebuild "ccpan" libraries with a newer version of tesseract; Some others? What would be the best fit?

Both solutions are OK. Personally I favor "the least required steps when starting from scratch".

As a developer, I prefer not having to install a lot of things to build something for the first time. That makes me more likely to contribute to a project than if I have to install a whole toolchain to get to a binary.

As a end-user, we should strive to provide a self-contained .msi that includes any library we use. Possibly including tesseract DLLs (so the user can replace them with new versions if he wants) would be better than statically linking tesseract.

cppan might have been the most convenient thing when it was added 3 years ago; it might not be the best solution today.

Since you are doing it, I'd say do whatever you prefer that works. If you get CCExtractor to report 4.11 (or whatever the current version is) and actually work, that's a better situation than what we have now.

@canihavesomecoffee is doing the GH actions integration (so we can get a full binary from GH, instead of me manually building releases) It would be great to have this working again.

apovalyaev commented 4 years ago

To make things work automatically, it should provide both tesseract-ocr libraries and tess-data compatible (this is what this issue is about). Hence, when building solution/package, it needs to (A) Replace outdated "ccpan" libraries within a newly rebuilt versions; (B) Add tessdata directory to git clone https://github.com/ccextractor repository;

As for Step (A)... Below are the steps to make the project using vcpkg supplied packages instead of precompiled "ccpan's" (in other words, all the libraries from directories in windows\libs\lib\release-lib and windows\libs\lib\debug-lib)

It is all about "Debug-Full" and "Release-Full" build modes: 1) Remove all files from directories: windows\libs\lib\release-lib windows\libs\lib\debug-lib and update additional libraries project settings to remove appropriate library dependencies (those "ccpan" libraries) 2) Issue the following command to stop VS automatically linking libraries supplied by vcpkg: vcpkg integrate remove Then: 2.1) vcpkg export --zip tesseract:x86-windows NOTE: of course, it is assume that appropriate packages are already installed (see vcpkg commands mentioned previously) This command automatically creates a .zip-achive including all the appropriate .lib files. The name of this archive will be something "vcpkg-export-....zip" (this name can be extracted from vcpkg command output). 2.2) Extract the archive to some appropriate location: vcpkg-export-20200427-142748\installed\x86-windows\lib 2.3) Copy all libraries from "installed\x86-windows\lib" subdirectory to ccextractor windows\libs\lib\release-lib (for release). The same things for debug ... ${vcpkg-export-directory}\installed\x86-windows\debug\lib -> ccextractor\windows\libs\lib\debug-lib 2.4) Update project "additional libraries" settings accordingly.

I will prepare a pull request within: (1) newly rebuild libraries (replace of old "ccpan's); (2) added tessdata subdirectory to ccextractor project.

mirh commented 2 years ago

If TESSDATA_PREFIX isn't set, the program will just look into its root folder. And once you throw in the age appropriate models you are good. Not a big deal really.

The problem if any is that you crash badly and without explanations after "FFMpeg Media Information".

prateekmedia commented 1 year ago

@cfsmp3 What do you think of this as we already have windows build system and CI fixed.

cfsmp3 commented 1 year ago

@cfsmp3 What do you think of this as we already have windows build system and CI fixed.

I think we're still missing the issue with the trained data file. If it's not found, rather that "Not found!" it should say:

"Not found. I looked in these directories: [ xxxx, xxxx, xxxx ]"