OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

cropping vs. cutting vs. segmenting #289

Closed kba closed 2 years ago

kba commented 5 years ago

In the docstrings, cropping currently refers to tasks that could be better described as segmenting (finding regions) or cutting (doing the actual image manipulation).

This came up in #268 but finding the right terminology should not prevent a merge.

We should also extend the glossary.

Here's the pertinent comments on the terms:

@bertsky:

@wrznr is right about insisting that the term cropping (de: Beschneidung) should only apply to the process of finding the Border (and perhaps also removing the margins from the image by cutting), not of other elements down the hierarchy. This we should rather call cutting (de: Freistellen) – it is only due to PIL.Image.crop that I was led astray. If this is correct, then the docstrings must be fixed accordingly throughout.

That would be more consequential, but I tend to say no: crop_image is meant as replacement for Image.crop and should be memorable. I am becoming less enthusiastic about this terminological distinction by the minute... maybe this should be reverted (sorry).

@wrznr:

Wrt. cropping vs. cutting (vs. segmenting?): Using the term cropping for localizing a page's border was a bad choice right from the start because it mixes the intellectual process of finding the borders and the physical process of separating the OCR-relevant from the irrelevant parts of the actual image. Using cutting does not improve things IMHO. The more I think about it, the more meaningful the use of the term (page-level) segmentation seems to me because this is what cropping right now does: It localizes the segment page on an image file. We could then use cropping as it is intended.

bertsky commented 5 years ago

This sounds very convincing to me. Except for one problem: (correct me if I am wrong, but) page segmentation usually refers to finding regions, not the border. It would make more sense to call that region segmentation, just as line segmentation creates lines, (so page segmentation would indeed be free for what we used to call cropping), but I never heard that.

wrznr commented 5 years ago

That's actually what we (@cneud and @kba and me) agreed on: To prefix segmentation with the result and not with the level of operation (i.e. segment image into X). You are absolutely right that page segmentation usually refers to segmentation of the page. But I prefer principle and sound solutions over traditions. 😁

bertsky commented 5 years ago

It is definitely a stumbling point for newcomers and users, but I am skeptical whether researchers can be convinced easily to adopt that change terminology. (In the least, page segmentation would have to be disambiguated verbosely for a while.)

Another established term is page frame detection. This already distinguishes itself from the physical operation (of cropping / cutting). So it might be a compromise (and smaller deviation from tradition) to use cropping only as an image operation (not a workflow step) in OCR-D, and consistently use page frame detection for the process of finding Border. As an extra, one could also refrain from using page segmentation and (provocatively but unambiguously) use region segmentation instead.

wrznr commented 5 years ago

It is a pity that the PAGE element is called Border. Maybe we should go with border_detection on the operation levels page, region and line.

bertsky commented 5 years ago

It is a pity that the PAGE element is called Border. Maybe we should go with border_detection on the operation levels page, region and line.

You mean instead of segmentation?

To prefix segmentation with the result and not with the level of operation (i.e. segment image into X).

But that (new) principle could still not be applied for page segmentation (in the new sense): Border detection does not actually segment the source image. So even with region segmentation established, I do not see a place for page segmentation, except in a broader sense covering all levels of segmentation.

wrznr commented 5 years ago

Yeah! That's why I propose a completely new wording:

ocrd_tesserocr_detect_border -I ORIGINAL -O CROPPED -m mets.xml -p <(echo '{"operation_level": "page"}')
ocrd_tesserocr_detect_border -I CROPPED -O SEGMENT_REGION -m mets.xml -p <(echo '{"operation_level": "region"}')
ocrd_tesserocr_detect_border -I SEGMENT_REGION -O SEGMENT_LINE -m mets.xml -p <(echo '{"operation_level": "line"}')

I.e. foregoing the new principle.

bertsky commented 5 years ago

I see. But the last 2 steps (region and line segmentation) do not actually detect any borders (i.e. outer limits) of regions and lines, they rather define those very regions and lines. IMHO we have no good reason to drop the term segmentation itself at this point.

Also, we should probably not concern ourself much with the names of components or processors here – as these need to accomodate other considerations (like using imperative verb forms instead of abstract nouns, e.g. recognize for OCR, correct for OCR post-correction, rate for LM rescoring, or being true to the implementation rather than the general operation they offer) – as much as with the terms we use to describe the workflow steps in our documentation.

That being said, I don't find the existing naming scheme of ocrd_tesserocr all that bad – although I wouldn't mind a slight change like so:

ocrd-tesserocr-crop-page -I OCR-D-IMG -O OCR-D-SEG-PAGE
ocrd-tesserocr-segment-regions -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK
ocrd-tesserocr-segment-lines -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE
kba commented 3 years ago

@bertsky Is there still something to do from this discussion?

bertsky commented 3 years ago

Is there still something to do from this discussion?

Hard to summarise, even harder to reach an agreement at this point.

We have:

We need to accomodate:

I'm afraid we cannot re-invent the wheel here, or just ignore existing terminology in the academic literature or in the field.

I suggest sticking to page frame detection when necessary to disambiguate over cropping, trying to avoid cropping as a general image operation, keeping the idiomatic page segmentation as a segmentation of pages into regions and line segmentation as a segmentation of regions into lines, but disambiguating further when necessary, and documenting all this in the glossary and specs.