cropper screws up royally

bertsky commented 3 years ago

Under certain conditions, ocrd-anybaseocr-crop selects only noise fragments of a page. It seems to be related to its builtin ad-hoc textline detection filter which is based on a number of assumptions (e.g. fixed kernel sizes indicate a certain pixel resolution and/or font size is expected).

Here's an example:

raw image:
binarized (sauvola-ms-split) image:
cropped image:

Looking at the debug images being created along the process (after repairing the array conversion dynamic range), it seems that the only criterion for text areas is a morphological closing with a fixed kernel size, which captures the texture of the background around the page: OCR-D-CROP_kalo_anon-IMG debug textarea_closed

(I wonder why no better method of textline detection was used...)

Then, in turn, the text boxes/columns of course look bad: OCR-D-CROP_kalo_anon-IMG debug textarea_boxes

In the end, the default minArea parameter of 0.05 removes all but the largest "column": OCR-D-CROP_kalo_anon-IMG debug textarea_boxes_areafiltered

This sheds a very bad light on that part of the algorithm.

But there's also the fallback mechanism of pylsd edge detection based border rectangle estimation. Since it is based on subpixel edge detection, this only works at all when running on the raw RGB image (instead of binarized), though. However, this works perfectly well:

First, the morphological closure again, this time of the raw image: OCR-D-CROP-RAW_kalo_anon-IMG debug_textarea_closed

This will find us no large enough text components to work with!

Now the edge detection comes into play: OCR-D-CROP-RAW_kalo_anon-IMG debug_lsd_edges

And from that its comparably easy to plausibilize the largest intersecting horizontal and vertical lines: OCR-D-CROP-RAW_kalo_anon-IMG debug_border_rect

Final cropped raw image:

kba commented 3 years ago

Thanks for investigating. Does this mean that the cropper should only be used on RGB images so the pylsd edge detection is executed?

bertsky commented 3 years ago

Does this mean that the cropper should only be used on RGB images so the pylsd edge detection is executed?

Yes, but in the current implementation with feature_selector=binarized you'd have to fake the binarization in the input. So really the code (including the OCR-D part) needs to be changed to take the raw image (for the pylsd part). However, for the textarea detector part, currently it uses an ad-hoc Otsu, which is clearly not a good choice. So at that point, one would need to take the binarized image from OCR-D (and probably modify the kernel size based on input image DPI with the usual zoom heuristic).

OCR-D / ocrd_anybaseocr

cropper screws up royally #83