OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

docker/install: build Tesseract from source #197

Closed joschrew closed 7 months ago

joschrew commented 7 months ago

This PR is part of series to offer single ocrd modules as Docker Containers (ocrd slim containers) to be used with ocr-d network.

This Dockerfile currently doesn't work in all cases and it still needs updates. I created the PR anyway because I use/need it for my tests. EDIT now works. (This basically migrates all the install-tesseract rules from ocrd_all's makefile here, where it actually belongs.)

My idea was to maybe create the tesseract Container with ocrd_all:

cd ocrd_all
git submodule update --init tesserocr/ core/ tesseract/ ocrd_tesserocr/
docker build --build-arg="OCRD_MODULES=core ocrd_tesserocr tesseract tesserocr " --no-cache -t my-ocrd-slim-container .
codecov[bot] commented 7 months ago

Welcome to Codecov :tada:

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered :open_umbrella:

stweil commented 7 months ago

I wonder whether there are still reasons for building the tesseract binary.

Using the package from a recent Linux distribution is simpler and would save significant build time.

Another possible approach would also work for tesserocr and some more parts of OCR-D: OCR-D could use its own package repositories for all parts with simple dependencies.

bertsky commented 7 months ago

I wonder whether there are still reasons for building the tesseract binary.

Using the package from a recent Linux distribution is simpler and would save significant build time.

Because most of the time, we cannot use Tesseract from a Linux distribution: our base distro is usually older than the current one, and we have no control over Tesseract features that we actually need. The same goes for PPA.

We had good reasons to pin to a specific Tesseract version via source build in subrepo. No reason to give that up now.

Another possible approach would also work for tesserocr and some more parts of OCR-D: OCR-D could use its own package repositories for all parts with simple dependencies.

Much simpler: conda

joschrew commented 7 months ago

@kba: Your changes resolved all my erros with my test workspace. I added a resmgr call to the dockerimage to add eng traineddata. I get an error when trying to process without it.

Edit: Maybe equ.traineddata and osd.traineddata should be added as well, I am not sure

bertsky commented 7 months ago

Adapting CircleCI config should follow.

In fact, since it already seems broken on master – unfortunately CircleCI does not keep the logs long enough, but I guess it's about the TESSDATA_PREFIX / resmgr location – we should fix this here.

So I suggest (after rewriting deps-ubuntu as proposed above) to update the CircleCI config to do make install-tesseract install-tesserocr before make install.

bertsky commented 7 months ago

In fact, since it already seems broken on master – unfortunately CircleCI does not keep the logs long enough, but I guess it's about the TESSDATA_PREFIX / resmgr location – we should fix this here.

So I suggest (after rewriting deps-ubuntu as proposed above) to update the CircleCI config to do make install-tesseract install-tesserocr before make install.

Now the CI config definitely needs make install-tesseract install-tesserocr. Also, we must drop the chmod workaround (for which there is no need anymore).

bertsky commented 7 months ago

Now the CI config definitely needs make install-tesseract install-tesserocr. Also, we must drop the chmod workaround (for which there is no need anymore).

@joschrew do you want me to make that change (on your fork's writable branch)?

bertsky commented 7 months ago

Oh, maybe we should also migrate make install tesseract-training here? (Once we remove these rules from ocrd_all, there would be no more way to compile lstmtraining, combine_tessdata etc.)