Closed cinchent closed 2 years ago
Thank you for reporting this -- I'm sorry it broke.
A simple work-around is to #include pix_internal.h in alltypes.h @stweil what do you suggest?
The same file also contains correct code how to get a Box
without using internal information:
for (int i = 0; i < boxaGetCount(regions_to_clear); ++i) {
Box *box = boxaGetBox(regions_to_clear, i, L_CLONE);
pixClearInRect(splitted_image_, box);
boxDestroy(&box);
}
Tesseract should have used pixaGetBox
, boxGetGeometry
and boxSetGeometry
.
It shows that any change of that internal information or the removal of it can break existing code, so is a major API change. But the good news at least for the case here is that the broken code can be fixed easily.
Some of us are still frozen on tess 4.0, as 5.0 is behaviorally very different, and not in a good way
That's not Leptonica related, but as I work on Tesseract I am curious to know in which way Tesseract 5 differs not in a good way. I thought it was mainly compatible with some valuable improvements.
I now fixed the code for Tesseract 5 in commit https://github.com/tesseract-ocr/tesseract/commit/f36c0d019be59cae3b96da0d89d870dbe83e9714.
Thanks for your prompt responses. I'll wait for an official fix, as in our application we have automated deploy scripting that pulls from your repo and builds, so we wouldn't want to apply manual patches.
Note that there won't be a fix for Tesseract 4.
@stweil Cool that you're on the Tess development team as well...
The downside we encounter using 5.x vs. 4.x is seriously degraded OCR accuracy. Our application interprets closed-captioning from live video broadcasts, and in 4.0, it was quite accurate, with the typical conflations of 1 <--> I and the like for the most part, but with 5.0, the extracted text is nearly complete gibberish. We have it on our back burner to investigate exactly why -- perhaps its a training model issue, we just use the standard training data set -- but FTTB we're continuing to back-rev to 4.x for code stability.
It would help if you could open an issue for tesseract-ocr/tesseract
and add an example there which shows the degradation. Or even better you could try to bisect which commit(s) for the Tesseract code introduced it.
Sure... will do when we have time. Our backlog is pretty extensive right now, so if 3rd-party tools like Tess introduce instabilities to our processing, we're just freezing on known-good versions until forced to by obsolescence. But yeah, good suggestions for when that bubbles to the top of the stack.
tesseract 4.1.3 build fails first here, with many errors following:
In that version
PIXA
is never resolved so all derived type definitions fail similarly.Reverted to version 1.82.0 for tesseract build, and all was well.
(Some of us are still frozen on tess 4.0, as 5.0 is behaviorally very different, and not in a good way.)