OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
71 stars 18 forks source link

why is Tesseract built with --disable-openmp #127

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

This goes back to the very first commit bringing Tesseract build rules by @stweil:

https://github.com/OCR-D/ocrd_all/blob/8fb9ee3b08848b7c2ff719ef181e85db486a3f18/Makefile#L468

Disabling OpenMP means loosing implicit CPU parallelization, which can speed up single-job workflows significantly.

Native system packages are usually built with OpenMP enabled.

Can we please drop --disable-openmp?

kba commented 4 years ago

IIRC the overhead of CPU-parallelization with openmp in tesseract was much higher than the gains in efficiency of parallelization. But @stweil knows best.

bertsky commented 4 years ago

IIRC the overhead of CPU-parallelization with openmp in tesseract was much higher than the gains in efficiency of parallelization.

Okay, so it might be inefficient usage of excess resources, but usage nevertheless! If you have a single workspace you want to process as fast as possible, and multiple cores kicking each others' heels, then having any within-module parallelization is better than none. And you can always disable OpenMP at runtime (by setting OMP_THREAD_LIMIT=1 OMP_NUM_THREADS=1, as workflow-configuration does when running make -j).

stweil commented 4 years ago

That setting is there because I think it's the best for most use cases.

Tesseract with OpenMP uses exactly 4 threads for parts of the total OCR process (probably for most of ocrd-tesserocr-recognize). In practice this accelerates the OCR process, but not by a factor 4. If you are lucky, you will get a factor of 2 or 3. The total CPU usage is much higher because of the overhead mentioned by @kba, with CPUs burning energy while synchronizing threads. There remains a significant overhead even if you set OMP_THREAD_LIMIT=1.

For training the situation is even worse, because training uses up to 8 threads (really bad on typical PCs which only support 4 or 6 parallel threads) and the overhead is so large that there is only a very small performance gain if at all.

On the over hand it is really easy and works very efficiently without any overhead to use parallel Tesseract runs, both for recognition (single threaded) and training (up to two threads). Even with OCR-D that can already be used with different workspaces.

Instead of changing the compiler options, it would be better to support parallelisation on the page level for selected processors. That would allow optimized usage of the available resources, from 2 or 4 threads on older hardware, 6 threads on current PCs, 32 or even up to 128 threads on recent server hardware.

stweil commented 4 years ago

@bertsky, a simple solution for you could be implemented by moving Tesseract's configure options into a macro, so you could override it in your personal local.mk. Or use the existing Debian packages which were built with OpenMP enabled.

bertsky commented 4 years ago

Thanks @stweil for that clarification.

So we cannot opt out of the overhead with the envvar, and it breaks some use cases having OpenMP compiled in. Then I agree we should ignore this option, and focus on implementing https://github.com/OCR-D/core/issues/322 as a general solution.

bertsky commented 4 years ago

a simple solution for you could be implemented by moving Tesseract's configure options into a macro, so you could override it in your personal local.mk.

Ah yes, having that option would be useful I think.

Or use the existing Debian packages which were built with OpenMP enabled.

I thought about that. Is there anything you know (besides loosing benefits of static build) that Alex' PPA builds could cause trouble with?

stweil commented 4 years ago

Nothing that I am aware of. Of course Alex' PPA builds are typically a little behind Git master (but usually not to much). And they must be told where to find the OCR-D model files for Tesseract because they use a different default path.

bertsky commented 4 years ago

And they must be told where to find the OCR-D model files for Tesseract because they use a different default path.

Right! But we can override TESSDATA and TESSDATA_PREFIX for OCR-D processors.

One complication is that ocrd_tesserocr currently wraps tesseract-ocr (4.1.1) instead of tesseract-ocr-devel (5.0) in its deps-ubuntu rule. Do you recall what was the reason behind that choice?

stweil commented 4 years ago

No, sorry, I don't remember that. Maybe because 4.1.x is the official stable version, while there is currently no released 5.0?

Version 5 is a moving target. The major version was increased from 4 to 5 because it is incompatible on the API level (for example classes and structs were optimized, header files removed, proprietary data types removed), and I would like to keep it open until more of that kind was done. Otherwise we'd have to go to release 6 for the next incompatible change.

bertsky commented 4 years ago

Sure, there's gotta be major for all dev work, I have always found that a good decision.

No, sorry, I don't remember that. Maybe because 4.1.x is the official stable version, while there is currently no released 5.0?

Either that or because of the dependency on tesserocr, which does not always keep up with 5.0 changes fast enough.

And would you recommend against using 4.1.1 for OCR-D purposes, or is this difference also negligible on average?

stweil commented 4 years ago

tesserocr works with 4.1 and 5 / git master, and I try to avoid breaking that, either by doing changes in a compatible way or by updating tesserocr.

Git master / 5 typically has more fixes which are only backported from time to time. And it has a better performance. In most cases I use OCR-D with a very recent Tesseract, but would not recommend against using 4.1.1.

bertsky commented 4 years ago

tesserocr works with 4.1 and 5 / git master, and I try to avoid breaking that, either by doing changes in a compatible way or by updating tesserocr.

Great, then we could change ocrd_tesserocr's deps-ubuntu to point to the tesseract-ocr-devel PPA. If users happen to update Tesseract too soon, they can always downgrade from the PPA until tesserocr is up to speed.

Or should we keep the deb-based installation variant more conservative on purpose? (Then I do indeed need a way to customize the configure options in the source build here.)