OCR-D / ocrd_anybaseocr

DFKI Layout Detection for OCR-D
Apache License 2.0
48 stars 12 forks source link

tiseg results not usable #80

Open bertsky opened 3 years ago

bertsky commented 3 years ago

The way in which the trained pixel classifier for text-image segmentation is integrated here makes these predictions completely unusable:

image part text part
FILE_0001_BIN-WOLF-DESKEW-CROP-TISEGDEEPML_img FILE_0001_BIN-WOLF-DESKEW-CROP-TISEGDEEPML_txt

The reason for this is actually quite simple: https://github.com/OCR-D/ocrd_anybaseocr/blob/e63f5555e4387e65fdd44469eadb51b09316aae6/ocrd_anybaseocr/cli/ocrd_anybaseocr_tiseg.py#L130-L137

Here, the predictions for text (1) and image (2) classes compete with the background (0) class. Where the argmax favours background over both, all is lost. This would be somewhat expectable and acceptable if this method was trained as a binarization method (on suitable GT and with suitable augmentation). But appearantly, it is not.

@mahmed1995 @mjenckel , am I correct in assuming you've used keras_segmentation for this, with 3 classes – 1 for text regions, 2 for image regions and 0 for background? What was the GT?

The obvious fix would be to just compare text vs image scores, and apply the result as an alpha mask on the original image. The result actually does look somewhat better.

But does any consuming processor actually make use of the alpha channel? I highly doubt it.

Since the model was obviously trained on raw images, we have to apply it on raw images. But we can still take binarized images (from a binarization step in the workflow) to apply our resulting mask – by filling with white.

That seems like the better OCR-D interface to me. (Of course, contour-finding and annotation via coordinates would still be better than as clipped derived image.) What do you think, @kba?

Also, I think it's not a good idea to just keep the best scoring pixels independent of each other. This leaves results unecessarily noisy and flickery, especially where confidence is low already. Smoothing via morphological post-processing (e.g. by closing the argmax results with a suitable kernel) or filtering (e.g. by a Gaussian filter on the scores) etc should be applied. (Ideally, the model itself would get trained with a fc-CRF top layer, but that's out of scope here.) What's the "right way" to do this?

Considering that the above shown result is still unusable, I think we need to consider post-processing for the neural segmentation.

Lastly, talking about the legacy text-image segmentation integrated here as well, this does at least work reliably:

image part result text part result
FILE_0003_BIN-WOLF-DESKEW-CROP-TISEGMORPH_img FILE_0003_BIN-WOLF-DESKEW-CROP-TISEGMORPH_txt

However, both of these approaches seem to only look for images, not for line-art separators at all. IMHO that latter task is much more needed (considering the existing tools available in OCR-D right now).