Version 1.83.0 breaks tesseract 4.x builds - Githubissues

DanBloomberg / leptonica

Leptonica is an open source library containing software that is broadly useful for image processing and image analysis applications. The official github repository for Leptonica is: danbloomberg/leptonica. See leptonica.org for more documentation.

Other

1.76k stars 387 forks source link

Version 1.83.0 breaks tesseract 4.x builds #618

Closed cinchent closed 2 years ago

cinchent commented 2 years ago

tesseract 4.1.3 build fails first here, with many errors following:

devanagari_processing.cpp: In member function 'bool tesseract::ShiroRekhaSplitter::Split(bool, tesseract::DebugPixa*)':
devanagari_processing.cpp:132:19: error: invalid use of incomplete type 'struct Pixa'
     Box* box = ccs->boxa->box[i];
                   ^~

In that version PIXA is never resolved so all derived type definitions fail similarly.

Reverted to version 1.82.0 for tesseract build, and all was well.

(Some of us are still frozen on tess 4.0, as 5.0 is behaviorally very different, and not in a good way.)

DanBloomberg commented 2 years ago

Thank you for reporting this -- I'm sorry it broke.

A simple work-around is to #include pix_internal.h in alltypes.h @stweil what do you suggest?

stweil commented 2 years ago

The same file also contains correct code how to get a Box without using internal information:

  for (int i = 0; i < boxaGetCount(regions_to_clear); ++i) {
    Box *box = boxaGetBox(regions_to_clear, i, L_CLONE);
    pixClearInRect(splitted_image_, box);
    boxDestroy(&box);
  }

Tesseract should have used pixaGetBox, boxGetGeometry and boxSetGeometry.

It shows that any change of that internal information or the removal of it can break existing code, so is a major API change. But the good news at least for the case here is that the broken code can be fixed easily.

stweil commented 2 years ago

Some of us are still frozen on tess 4.0, as 5.0 is behaviorally very different, and not in a good way

That's not Leptonica related, but as I work on Tesseract I am curious to know in which way Tesseract 5 differs not in a good way. I thought it was mainly compatible with some valuable improvements.

stweil commented 2 years ago

I now fixed the code for Tesseract 5 in commit https://github.com/tesseract-ocr/tesseract/commit/f36c0d019be59cae3b96da0d89d870dbe83e9714.

cinchent commented 2 years ago

Thanks for your prompt responses. I'll wait for an official fix, as in our application we have automated deploy scripting that pulls from your repo and builds, so we wouldn't want to apply manual patches.

stweil commented 2 years ago

Note that there won't be a fix for Tesseract 4.

cinchent commented 2 years ago

@stweil Cool that you're on the Tess development team as well...

The downside we encounter using 5.x vs. 4.x is seriously degraded OCR accuracy. Our application interprets closed-captioning from live video broadcasts, and in 4.0, it was quite accurate, with the typical conflations of 1 <--> I and the like for the most part, but with 5.0, the extracted text is nearly complete gibberish. We have it on our back burner to investigate exactly why -- perhaps its a training model issue, we just use the standard training data set -- but FTTB we're continuing to back-rev to 4.x for code stability.

stweil commented 2 years ago

It would help if you could open an issue for tesseract-ocr/tesseract and add an example there which shows the degradation. Or even better you could try to bisect which commit(s) for the Tesseract code introduced it.

cinchent commented 2 years ago

Sure... will do when we have time. Our backlog is pretty extensive right now, so if 3rd-party tools like Tess introduce instabilities to our processing, we're just freezing on known-good versions until forced to by obsolescence. But yeah, good suggestions for when that bubbles to the top of the stack.