OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

ocrd-tesserocr-segment: segmentation fault #182

Closed jbarth-ubhd closed 2 years ago

jbarth-ubhd commented 2 years ago

And with this image:

https://digi.ub.uni-heidelberg.de/diglitData/v/justinian1627bd2_-_1281.tif

and ocrd.sif (singularity container) created from docker ocrd_all at Nov 9 10:13 2021 & at Jan 17 15:11 2022 [UPDATE]

and this workflow:

/usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace init >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif >>ocrd.log 2>&1 || exit

/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -P k 0.10 -I OCR-D-IMG -O OCR-D-001 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true -I OCR-D-004 -O OCR-D-005 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-calamari-recognize -I OCR-D-005 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/calamari_models_experimental/historical_french_2020-10-14/*.ckpt.json" >>ocrd.log 2>&1 || exit

I'll get a segmentation fault

Core was generated by `/usr/bin/python3 /usr/bin/ocrd-tesserocr-segment -P find_tables false -P shrink'.
Program terminated with signal 11, Segmentation fault.
jbarth-ubhd commented 2 years ago

Tried to reproduce this bug with plain tesseract:

tesseract --tessdata-dir tessdata_dir -l Latin OCR-D-005_00001.IMG-BIN.png output --psm 11 -c textord_tabfind_find_tables=0 -c poly_wide_objects_better=0

but I don't know if those options are equivalent to the above.

jbarth-ubhd commented 2 years ago

I'll have approx. 1500 "core.12345" files of 62k TIFs = 2.4 % (!). Dear @stweil, could you prioritize this issue?

stweil commented 2 years ago

I must try to reproduce it in my environment. That would be easier if the problem would also occur with plain tesseract.

bertsky commented 2 years ago

@jbarth-ubhd poly_allow_detailed_fx and poly_wide_objects_better is a completely different, Tesseract-internal mechanism. (It is only used – as PolygonalCopy – indirectly in a few places, like debugging or equation detection, never for extracting outlines. Since it is not exposed to the API, I have no idea what the quality would be.)

The mechanism used for ocrd_tesserocr's shrink_polygons is explained by its documentation:

When detecting any segments, annotate polygon coordinates instead of bounding box rectangles by projecting the convex hull of all symbols.

If shrink_polygons, then during segmentation (on any level), query Tesseract for all symbols/glyphs of each segment and calculate the convex hull for them. Annotate the resulting polygon instead of the coarse bounding box. (This is more precise and helps avoid overlaps between neighbours, especially when not segmenting all levels at once.)

@stweil, the underlying cause is a bug in the iterator (state) functions – but I have no time to work on Tesseract, and my fix has become more difficult to work on after the recent upstream changes.

stweil commented 2 years ago

Please try this patch for the Tesseract code:

diff --git a/src/ccmain/pageiterator.cpp b/src/ccmain/pageiterator.cpp
index e8d528b6..829a1cd1 100644
--- a/src/ccmain/pageiterator.cpp
+++ b/src/ccmain/pageiterator.cpp
@@ -566,7 +566,14 @@ void PageIterator::Orientation(tesseract::Orientation *orientation,
                                tesseract::WritingDirection *writing_direction,
                                tesseract::TextlineOrder *textline_order,
                                float *deskew_angle) const {
-  BLOCK *block = it_->block()->block;
+  auto *block_res = it_->block();
+  if (block_res == nullptr) {
+    *orientation = ORIENTATION_PAGE_UP;
+    *writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
+    *textline_order = TEXTLINE_ORDER_TOP_TO_BOTTOM;
+    return;
+  }
+  auto *block = block_res->block;

   // Orientation
   FCOORD up_in_image(0.0, 1.0);
stweil commented 2 years ago

@jbarth-ubhd, the latest tesseract git main includes the patch which fixes the segmentation fault. Maybe you want to try it and can report whether it produces usable results for the examples which crashed with the old code. I cannot test it myself without the model historical_french_2020-10-14 which you used for the example.

jbarth-ubhd commented 2 years ago

The models is here: https://github.com/Calamari-OCR/calamari_models/tree/16630e34ed77e7d6fa735c2505c82c081dbeb42a/historical_french

stweil commented 2 years ago

Thanks. I could run your workflow after an update to latest tesseract and had no problems.

kba commented 2 years ago

@jbarth-ubhd The fix @stweil mentioned is also part of the newest ocrd_all release, so please update your docker/singularity image.

stweil commented 2 years ago

@jbarth-ubhd, can we close this issue?

jbarth-ubhd commented 2 years ago

yes.