Open cfsmp3 opened 4 years ago
Jokes aside, try PR #1170 on windows, might solve the problem. Also play around with deleting the data directory. You might fall on a message by tesseract that says where it looked for the data and didn't find it. If this message is not to your liking, it can be modified/suppressed using the dup2
syscall.
EDIT: tesseract may have an API which wouldn't require the use of dup2
.
Issue might be the result of not-properly builded solution. The issue should not appear if the solution is built properly. I checked it within VS2015 and VS2019 (default SDKs are used) and have not faced out such kind of issue.
We just need to work on OCR + Windows.
In my opinion, at the very least:
1. Proper information to the user, including which paths are being searched. And where do the errors come from? Is it tesseract, or us? Are we bailing out before even giving tesseract a try? 2. Update tesseract 4 to last version OR downgrade to 3. But using 4.00.00 is ridiculous! It's buggy. 3. Check if there's an officially compiled binary we can use. I remember we did our own thing a long time ago. Still needed?
Labelling HARD because we seem to be unable to fix it once and for all.
cc: @ShraxO1
Now after "Update VS project build settings" we can use the following steps (which automatically takes the last version of tesseract, as for now it is tesseract-4.1.1)
Build steps which use last version of Tesseract: 1) Clone repository https://github.com/CCExtractor/ccextractor 2) Setting up vcpkg: 2.1) git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg PS> .\bootstrap-vcpkg.bat 2.2) Modify vcpkg/triplets/x86-windows.cmake set(VCPKG_CRT_LINKAGE static) set(VCPKG_LIBRARY_LINKAGE static) 3) Installing the last verified version of tesseract NOTE: Now it is tesseract-4.1.1 vcpkg install tesseract:x86-windows vcpkg integrate install 4) Building the solution
So, further steps: 1) It make sense to update auto-build scripts, so that auto-build takes also the last verified version of tesseract; 2) Something else?
I'd say there's something missing here. I followed your instructions, no errors (good), but the binary is still using tessearct-4.00dev, which makes sense - why would it pick any other version if that's the one we have inside the project?
Let's check if we are on the same page: 1) Per my understanding, we have only got ffmpeg libraries precompiled inside project (directory windows/libs/lib/); So, when just being cloned, ccextractor should not be built unless there is already some other "copy" of tesseract library which is installed not through vcpkg. If the project compiled fine before you issued "vcpkg install tesseract:x86-windows" command, it means you have already installed some other copy of tesseract. It makes sense to remove it; 2) The other possible reason is what particular version of tesseract vcpkg has installed. You can use command "vcpkg list" to check what version of tesseract you have installed on your PC;
@apovalyaev Tesseract is included in those "cppan" dependencies; refer to https://github.com/CCExtractor/ccextractor/tree/master/windows/libs/lib/release-lib
Refer to https://github.com/CCExtractor/ccextractor/pull/592 for the PR, and maybe @Izaron could explain a bit if needed?
@apovalyaev Tesseract is included in those "cppan" dependencies; refer to https://github.com/CCExtractor/ccextractor/tree/master/windows/libs/lib/release-lib
Refer to #592 for the PR, and maybe @Izaron could explain a bit if needed?
I've taken a look #592 to see that tesseract was manually compiled. 1) #592 was closed at the beginning of 2017, so it might be outdated a little bit; 2) From the other hand, it looks like there is a bug in cppan dependencies (which was discovered while running with "-hardsubx" option;
I can see two ways: 1) Remove "ccpan" dependencies to see how it will work within vcpkg; 2) Rebuild "ccpan" libraries with a newer version of tesseract; Some others? What would be the best fit?
I can see two ways:
- Remove "ccpan" dependencies to see how it will work within vcpkg;
- Rebuild "ccpan" libraries with a newer version of tesseract; Some others? What would be the best fit?
Both solutions are OK. Personally I favor "the least required steps when starting from scratch".
As a developer, I prefer not having to install a lot of things to build something for the first time. That makes me more likely to contribute to a project than if I have to install a whole toolchain to get to a binary.
As a end-user, we should strive to provide a self-contained .msi that includes any library we use. Possibly including tesseract DLLs (so the user can replace them with new versions if he wants) would be better than statically linking tesseract.
cppan might have been the most convenient thing when it was added 3 years ago; it might not be the best solution today.
Since you are doing it, I'd say do whatever you prefer that works. If you get CCExtractor to report 4.11 (or whatever the current version is) and actually work, that's a better situation than what we have now.
@canihavesomecoffee is doing the GH actions integration (so we can get a full binary from GH, instead of me manually building releases) It would be great to have this working again.
To make things work automatically, it should provide both tesseract-ocr libraries and tess-data compatible (this is what this issue is about). Hence, when building solution/package, it needs to (A) Replace outdated "ccpan" libraries within a newly rebuilt versions; (B) Add tessdata directory to git clone https://github.com/ccextractor repository;
As for Step (A)... Below are the steps to make the project using vcpkg supplied packages instead of precompiled "ccpan's" (in other words, all the libraries from directories in windows\libs\lib\release-lib and windows\libs\lib\debug-lib)
It is all about "Debug-Full" and "Release-Full" build modes: 1) Remove all files from directories: windows\libs\lib\release-lib windows\libs\lib\debug-lib and update additional libraries project settings to remove appropriate library dependencies (those "ccpan" libraries) 2) Issue the following command to stop VS automatically linking libraries supplied by vcpkg: vcpkg integrate remove Then: 2.1) vcpkg export --zip tesseract:x86-windows NOTE: of course, it is assume that appropriate packages are already installed (see vcpkg commands mentioned previously) This command automatically creates a .zip-achive including all the appropriate .lib files. The name of this archive will be something "vcpkg-export-....zip" (this name can be extracted from vcpkg command output). 2.2) Extract the archive to some appropriate location: vcpkg-export-20200427-142748\installed\x86-windows\lib 2.3) Copy all libraries from "installed\x86-windows\lib" subdirectory to ccextractor windows\libs\lib\release-lib (for release). The same things for debug ... ${vcpkg-export-directory}\installed\x86-windows\debug\lib -> ccextractor\windows\libs\lib\debug-lib 2.4) Update project "additional libraries" settings accordingly.
I will prepare a pull request within: (1) newly rebuild libraries (replace of old "ccpan's); (2) added tessdata subdirectory to ccextractor project.
If TESSDATA_PREFIX
isn't set, the program will just look into its root folder.
And once you throw in the age appropriate models you are good. Not a big deal really.
The problem if any is that you crash badly and without explanations after "FFMpeg Media Information".
@cfsmp3 What do you think of this as we already have windows build system and CI fixed.
@cfsmp3 What do you think of this as we already have windows build system and CI fixed.
I think we're still missing the issue with the trained data file. If it's not found, rather that "Not found!" it should say:
"Not found. I looked in these directories: [ xxxx, xxxx, xxxx ]"
While testing a previous ticket regarding hardsubx on Windows, on master. Running this exact version, just compiled:
First, the reports, as usual about eng.traineddata couldn't suck more.
Seriously, would it kill us to tell the user WHERE we expect that file to be present?
OK So since I didn't remember how this worked at all I started looking into the code a bit. We do look TESSDATA_PREFIX amount other places /usr/share. Wait what? This is Windows! Why are we looking there? Also I see lots of / as path separator, but Windows uses . Is this portable at all?
OK, so I set set the env variable:
Still not working. Problem now is that I'm missing a \ at the end of the end variable.
OK so let's set it correct:
So now apparently it starts at least, but then it crashes.
We just need to work on OCR + Windows.
In my opinion, at the very least:
1) Proper information to the user, including which paths are being searched. And where do the errors come from? Is it tesseract, or us? Are we bailing out before even giving tesseract a try? 2) Update tesseract 4 to last version OR downgrade to 3. But using 4.00.00 is ridiculous! It's buggy. 3) Check if there's an officially compiled binary we can use. I remember we did our own thing a long time ago. Still needed?
Labelling HARD because we seem to be unable to fix it once and for all.
cc: @ShraxO1