Open jbarth-ubhd opened 2 years ago
BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow)
— what about giving grades & est. processing times & memory requirements to processors?
PS: ocrd-tesserocr-segment* (recommended) are not in the »Best results for selected pages« workflow. (see below)
I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).«
— ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above.
I'm missing the word
region
in the parameters (regions→lines)
Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.
BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow)
I agree – this information does not reflect the new or changed processors from the last 2 years. (I believe ocrd-tesserocr-segment-region
started out as the only recommendation, then ocrd-tesserocr-segment
was added when it became available. But I would not recommend the former anymore, and rather recommend ocrd-eynollah-segment
and ocrd-cis-ocropy-segment
now.
See also #172
— what about giving grades & est. processing times & memory requirements to processors?
Grades are too simplistic for the diversity of materials (from simple single-column books to multi-column ornamented/illustrated pages and title pages) and problems (region types, region shape complexity, region recursion, reading order, line segmentation in warped/straight imaging, in dense/floating typesetting, in tables).
Processing times and memory requirements, too, may depend on the image resolution and content. But indeed, we should try to provide some guesstimate or experience.
See also https://github.com/OCR-D/ocrd_all/issues/112 and https://github.com/OCR-D/assets/issues/75 (and https://github.com/OCR-D/core/issues/607)
I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).«
— ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above.
That sentence is part of the paragraph which explains the need for postprocessing when not using all-in-one segmentation or shrink_polygons with Tesseract – so it is necessary there. (No one without minute knowledge of Tesseract internals would understand that dependency.)
Best results for selected pages — workflow
* cis-ocropy-binarize is _not_ recommended(?) * skimage-binarize is _not_ recommended(?) * tesserocr-deskew is _not_ recommended(?) * cis-ocropy-segment is _not_ recommended(?)
Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.
* `segmentation_level` determines the *highest level* to segment.
Use `"none"` to disable segmentation altogether, i.e. only recognize existing segments.
* `textequiv_level` determines the *lowest level* to segment.
Use `"none"` to segment until the lowest level (`"glyph"`) and disable recognition altogether, only analyse layout.
highest level
= something like region and lowest level
= something like glyph?
and to segment
= to be segmented or to be the result of segmentation?
and none to segment ... disable recognition altogether
— recognition
of layout or recognition
of text? And why only analyse layout
— this step is about Region segmentation
Sorry, I'm confused.
highest level
= something like region andlowest level
= something like glyph?
yes
and
to segment
= to be segmented or to be the result of segmentation?
the latter
and
none to segment ... disable recognition altogether
—recognition
of layout orrecognition
of text?
in this paragraph (as in all of our documentation), recognition contrasts with segmentation (and preprocessing and postprocessing), so the latter
And why
only analyse layout
— this step is about Region segmentation
because this paragraph describes a multi-step processor that can include (text) recognition
I'm missing the word
region
in the parameters (regions→lines)