OCR-D / ocrd-website

24 stars 8 forks source link

workflows.md, Step 7 #268

Open jbarth-ubhd opened 2 years ago

jbarth-ubhd commented 2 years ago
Examples:
* To segment existing regions into lines (and only lines) only: 
    `segmentation_level="line"`, `textequiv_level="line"`, `model=""`
* To segment existing regions into lines (and only lines) and recognize text:
    `segmentation_level="line"`, `textequiv_level="line"`, `model="Fraktur"`

I'm missing the word region in the parameters (regions→lines)

jbarth-ubhd commented 2 years ago

BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow)

— what about giving grades & est. processing times & memory requirements to processors?

jbarth-ubhd commented 2 years ago

PS: ocrd-tesserocr-segment* (recommended) are not in the »Best results for selected pages« workflow. (see below)

jbarth-ubhd commented 2 years ago

I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).«

ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above.

jbarth-ubhd commented 2 years ago

Best results for selected pages — workflow

bertsky commented 2 years ago

I'm missing the word region in the parameters (regions→lines)

Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.

BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow)

I agree – this information does not reflect the new or changed processors from the last 2 years. (I believe ocrd-tesserocr-segment-region started out as the only recommendation, then ocrd-tesserocr-segment was added when it became available. But I would not recommend the former anymore, and rather recommend ocrd-eynollah-segment and ocrd-cis-ocropy-segment now.

See also #172

— what about giving grades & est. processing times & memory requirements to processors?

Grades are too simplistic for the diversity of materials (from simple single-column books to multi-column ornamented/illustrated pages and title pages) and problems (region types, region shape complexity, region recursion, reading order, line segmentation in warped/straight imaging, in dense/floating typesetting, in tables).

Processing times and memory requirements, too, may depend on the image resolution and content. But indeed, we should try to provide some guesstimate or experience.

See also https://github.com/OCR-D/ocrd_all/issues/112 and https://github.com/OCR-D/assets/issues/75 (and https://github.com/OCR-D/core/issues/607)

I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).«

ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above.

That sentence is part of the paragraph which explains the need for postprocessing when not using all-in-one segmentation or shrink_polygons with Tesseract – so it is necessary there. (No one without minute knowledge of Tesseract internals would understand that dependency.)

Best results for selected pages — workflow

* cis-ocropy-binarize is _not_ recommended(?)

* skimage-binarize is _not_ recommended(?)

* tesserocr-deskew is _not_ recommended(?)

* cis-ocropy-segment is _not_ recommended(?)

172

jbarth-ubhd commented 2 years ago

Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.

* `segmentation_level` determines the *highest level* to segment. 
   Use `"none"` to disable segmentation altogether, i.e. only recognize existing segments.
* `textequiv_level` determines the *lowest level* to segment. 
   Use `"none"` to segment until the lowest level (`"glyph"`) and disable recognition altogether, only analyse layout.

highest level = something like region and lowest level = something like glyph?

and to segment = to be segmented or to be the result of segmentation?

and none to segment ... disable recognition altogetherrecognition of layout or recognition of text? And why only analyse layout — this step is about Region segmentation

Sorry, I'm confused.

bertsky commented 2 years ago

highest level = something like region and lowest level = something like glyph?

yes

and to segment = to be segmented or to be the result of segmentation?

the latter

and none to segment ... disable recognition altogetherrecognition of layout or recognition of text?

in this paragraph (as in all of our documentation), recognition contrasts with segmentation (and preprocessing and postprocessing), so the latter

And why only analyse layout — this step is about Region segmentation

because this paragraph describes a multi-step processor that can include (text) recognition