OCR-D / ocrd_segment

OCR-D-compliant page segmentation
MIT License
66 stars 15 forks source link

plausibilize and sanitize are too broad terms #18

Open mikegerber opened 4 years ago

mikegerber commented 4 years ago

ocrd-segment-repair has the optional operations "plausibilize" and "sanitize" – I have no idea what this exactly does :) I would prefer something like this:

There seems to also be another thing ocrd-segment-repair does.

In other words: Make operations explicit.

bertsky commented 4 years ago

ocrd-segment-repair has the optional operations "plausibilize" and "sanitize" – I have no idea what this exactly does :)

I agree, these are not expressive enough, or even memorable (which is what...)

I would prefer something like this:

* shrink-regions-to-hull-of-lines

...or just shrink-regions?

* whatever-plausibilize-does

ATM all it does is remove regions fully contained by others or nearly equal to them (and fix the ReadingOrder afterwards).

It's intended to become much more though, like merging or shrinking overlapping neighbouring regions, or fixing reading order via basic heuristics (e.g. no arbitrary jumps back and forth).

Since this processor started out under the name repair but received a default behaviour of just warning about likely errors, we needed some verb for the actual action.

Maybe separate-neighbours?

@wrznr?

wrznr commented 4 years ago

Right, they have very common names since they are intended to do various things. Right now, they do not do very much and are not ready for productive use or even testing. I would rather keep the current names and see what the processors will become. Let us discuss about a proper name when implementation and documentation are finished. (ocrd_segment will be my main focus in December)

mikegerber commented 4 years ago

Related: qurator-spk/ocrd_repair_inconsistencies#2

mikegerber commented 3 years ago

Documentation from https://ocr-d.de/en/workflows:

bertsky commented 3 years ago

Documentation from https://ocr-d.de/en/workflows:

  • plausibilize = Remove redundant (almost equal or almost contained) regions, and merge overlapping regions
  • sanitize = Shrink and/or expand a region in such a way that it coordinates include those of all its lines

This is actually from the ocrd-tool json description of these parameters, see ocrd-segment-repair -h