cisocrgroup / ocrd_cis

OCR-D python tools
MIT License
33 stars 12 forks source link

Column segmentation failure #59

Closed jbarth-ubhd closed 4 years ago

jbarth-ubhd commented 4 years ago

workflow:

ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page" \
"cis-ocropy-segment -I OCR-D-N5 -O OCR-D-N6 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation region" \
"cis-ocropy-clip -I OCR-D-N7 -O OCR-D-N8 -P level-of-operation region" \
"cis-ocropy-dewarp -I OCR-D-N8 -O OCR-D-N9" \
"calamari-recognize -I OCR-D-N9 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json"

TIFFs: 0001 0017 0042

bertsky commented 4 years ago

Thanks @jbarth-ubhd for the detailed report!

I'll address these separately:

0001

0001 is not segmented in 3 columns, but vertical separator line is without any breaks after wolf binarization.

Very interesting. Ocropy detects the v-lines flawlessly, but nevertheless does not use them to split the region. (Part of the problem here is deskewing, which does not work very well on multi-column layouts, because Ocropy is based on projection profile entropy. But even with Tesseract for deskewing, which uses Hough-line transform, the left separator does not split.)

This can be helped a little by reducing gap_width (as the text is unusually close to the v-line). But the left separator never works, and that's because of an actual bug: see the adjacent words Wenn der / brechen at the very bottom of the separator? Ocropy line segmentation (which needs to run with all foreground separators suppressed) considers these to belong to the same line seed, because they are so close. The v-line separator is still too short to cut through it. This in turn makes them share a line label, which prevents the recursive XY-cut segmentation to split at that line (because it would remove significant parts on either side).

I don't have too many terribly good ideas how to fix this in general. I did try to dilate the separator lines just a little in their overall direction. When combined with gap_width=0.7 (instead of default 1.5), this does fix the page. But I am not sure whether to keep that change: 0001-bfa1867_-_0316_ocropy_dilate-seps_gap-width_0 7

0017

0017 does not segment correctly when binarized, but original (color)

You probably mean Tesseract's page segmentation (which fails completely), not Ocropy's (which looks perfect to me), right?

0042

0042 does not segment correctly.

Again, you probably mean Tesseract (failing completely), not this module. Ocropy does make some errors here though: 3 of the 7 column segments get split half-way between the index terms and the page numbers. As you have pointed out correctly, this is not caused by binarization eating up the elliptical dots. But still, as I have now discovered, these do get ignored by Ocropy – because they are too small compared to the average glyph size. (Ocropy calculates the median square root of the bbox area of all connected components, termed scale, and suppresses all components with square root bbox area less than 0.5 or more than 4 times the scale.) Neglecting small components usually helps segmentation be robust against noise from binarization (which to suppress more aggressively is harmful to recognition later). So I don't know how to solve this right now:

0042-meggendorfer_hb24_-_000_c_ocropy_boxmap_scale_filter

bertsky commented 4 years ago

I don't have too many terribly good ideas how to fix this in general. I did try to dilate the separator lines just a little in their overall direction. When combined with gap_width=0.7 (instead of default 1.5), this does fix the page. But I am not sure whether to keep that change:

@jbarth-ubhd could you please pull #61 and run it on your benchmark to see if this degrades segmentation elsewhere?

jbarth-ubhd commented 4 years ago

0017 + 0042: yes, tesseract, sorry.

bertsky commented 4 years ago

0017 + 0042: yes, tesseract, sorry.

Strictly does not belong here, but briefly: Tesseract has problems with dots, too. But I cannot reproduce what you see (i.e. near nothing on the index page). For me, Tesseract also struggles, but not nearly as bad. It also depends a lot on whether you allow table detection and whether you pass the raw or binary image to Tesseract: