Closed jbarth-ubhd closed 2 years ago
Tried to reproduce this bug with plain tesseract:
tesseract --tessdata-dir tessdata_dir -l Latin OCR-D-005_00001.IMG-BIN.png output --psm 11 -c textord_tabfind_find_tables=0 -c poly_wide_objects_better=0
but I don't know if those options are equivalent to the above.
I'll have approx. 1500 "core.12345" files of 62k TIFs = 2.4 % (!). Dear @stweil, could you prioritize this issue?
I must try to reproduce it in my environment. That would be easier if the problem would also occur with plain tesseract
.
@jbarth-ubhd poly_allow_detailed_fx
and poly_wide_objects_better
is a completely different, Tesseract-internal mechanism. (It is only used – as PolygonalCopy
– indirectly in a few places, like debugging or equation detection, never for extracting outlines. Since it is not exposed to the API, I have no idea what the quality would be.)
The mechanism used for ocrd_tesserocr's shrink_polygons
is explained by its documentation:
When detecting any segments, annotate polygon coordinates instead of bounding box rectangles by projecting the convex hull of all symbols.
If
shrink_polygons
, then during segmentation (on any level), query Tesseract for all symbols/glyphs of each segment and calculate the convex hull for them. Annotate the resulting polygon instead of the coarse bounding box. (This is more precise and helps avoid overlaps between neighbours, especially when not segmenting all levels at once.)
@stweil, the underlying cause is a bug in the iterator (state) functions – but I have no time to work on Tesseract, and my fix has become more difficult to work on after the recent upstream changes.
Please try this patch for the Tesseract code:
diff --git a/src/ccmain/pageiterator.cpp b/src/ccmain/pageiterator.cpp
index e8d528b6..829a1cd1 100644
--- a/src/ccmain/pageiterator.cpp
+++ b/src/ccmain/pageiterator.cpp
@@ -566,7 +566,14 @@ void PageIterator::Orientation(tesseract::Orientation *orientation,
tesseract::WritingDirection *writing_direction,
tesseract::TextlineOrder *textline_order,
float *deskew_angle) const {
- BLOCK *block = it_->block()->block;
+ auto *block_res = it_->block();
+ if (block_res == nullptr) {
+ *orientation = ORIENTATION_PAGE_UP;
+ *writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
+ *textline_order = TEXTLINE_ORDER_TOP_TO_BOTTOM;
+ return;
+ }
+ auto *block = block_res->block;
// Orientation
FCOORD up_in_image(0.0, 1.0);
@jbarth-ubhd, the latest tesseract git main includes the patch which fixes the segmentation fault. Maybe you want to try it and can report whether it produces usable results for the examples which crashed with the old code. I cannot test it myself without the model historical_french_2020-10-14
which you used for the example.
Thanks. I could run your workflow after an update to latest tesseract and had no problems.
@jbarth-ubhd The fix @stweil mentioned is also part of the newest ocrd_all release, so please update your docker/singularity image.
@jbarth-ubhd, can we close this issue?
yes.
And with this image:
https://digi.ub.uni-heidelberg.de/diglitData/v/justinian1627bd2_-_1281.tif
and ocrd.sif (singularity container) created from docker ocrd_all at Nov 9 10:13 2021 & at Jan 17 15:11 2022 [UPDATE]
and this workflow:
I'll get a
segmentation fault