Closed jbarth-ubhd closed 4 years ago
@jbarth-ubhd first, please allow me to comment on your choice of workflow:
"anybaseocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW"
I recommend avoiding deskewing from ocrd_anybaseocr. It's just a rebrand of ocropus/ocrolib facilities, but it does not respect our coordinate consistency principle (by rotating the image without also enlarging it, thereby throwing away information at the corners and making follow-up steps in the workflow unpredictable – cf https://github.com/kba/ocrd_anybaseocr/issues/47).
Instead, if you use ocrd-cis-ocropy-deskew
you get a rebrand of ocropus/ocrolib that is not only more correct, but also more accurate: it adds a confidence threshold on top of the old implementation.
"cis-ocropy-segment -I OCR-D-CROP -O OCR-D-SEG-REG -p '{\"level-of-operation\":\"page\"}'"
I strongly recommend against this on the page level. This is a very crude attempt at building regions from lines. There are much better implementations for that. (However, line segmentation within regions is quite competetive with this processor)
Use ocrd-tesserocr-segment-region
, possibly in combination with ocrd-segment-repair
plausibilize=true
(for post-processing) and ocrd-cis-ocropy-clip
on the region level (to suppress non-text within text regions).
"tesserocr-segment-line -I OCR-D-SEG-REG -O OCR-D-SEG-LINE" "cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -p '{\"level-of-operation\":\"line\"}'"
If you use a bbox-only line segmentation, I highly recommend doing polygonalization right after that, not just clipping. Clipping is a dumb last resort operation. In contrast, polygonalization can be very accurate.
I recommend ocrd-cis-ocropy-resegment
instead. (But also try using ocrd-cis-ocropy-segment
on the region level, which is already polygonal, instead of Tesseract line segmentation plus postprocessing.)
Cf. https://hackmd.io/@FKFH0M1sR2SdJZwK5U8Cfg/S1YQ4NeNr#/3/4 ff.
Now as to the bug you found: thanks for reporting – I can reproduce. Even #120, which is related, does not help.
It appears that the results from ocrd-cis-ocropy-segment
on the page level are riddled with coordinate invalidities and inconsistencies. (I should probably take this variant down completely.)
I could catch those cases here in the line segmentation of course. But this – again – raises the larger question of whether or not processors should be robust to invalid input. Instead of adding this kind of extra robustness to nearly all existing processors (and thereby bloating their code-base), IMHO we should enforce correct output with our validators.
@kba @wrznr please comment!
Just before I forget to mention this later: this workflow is the »Good results for all pages« recommended on https://ocr-d.de/en/workflows .
Just before I forget to mention this later: this workflow is the »Good results for all pages« recommended on https://ocr-d.de/en/workflows .
My goodness! I wish I knew how they came up with that methodologically...
@bertsky Not much left to comment on: I always tell people not to use the segmentation facilities implemented in OCRopus. We should focus on Tesseract (morphology-based) and asap. kraken (data-driven).
processors should be robust to invalid input
No they do not have to. At least not for input which has been generated automatically by non-recommended workflows. We should think about extending kwalitee
in the direction of recommended workflows however.
I always tell people not to use the segmentation facilities implemented in OCRopus. We should focus on Tesseract (morphology-based) and asap. kraken (data-driven).
I am confused:
processors should be robust to invalid input
No they do not have to
Fine, then this is agreed. Input can be assumed be valid and consistent, otherwise no guarantees at accurate or even complete output. Maybe this should be stated more prominently in the specs!
At least not for input which has been generated automatically by non-recommended workflows
Note: this is a currently recommended workflow (unfortunately).
I am confused
No need to! I did not say anything about Ocropy. However, I think we should focus on one morphology-based segmentation method rather than being busy with keeping multiple once to date (i.e. with changes in core
and spec
). Tesseract seems to me the most likely candidate.
kraken
https://github.com/mittagessen/kraken/tree/blla Pls. note the small reserve asap. in my proposal.
Note: this is a currently recommended workflow (unfortunately).
A discussion about those recommendations is immediate. It is also a question who recommends what. Obviously the two of us (which could be called the most active users of the OCR-D tools for productive text digitization) do not recommend Ocropy
for region segmentation.
I did not say anything about Ocropy
I misread your statement. You were only concerned with page/region segmentation. (I still do recommend Ocropy's line segmentation, because it's polygonal.)
However, I think we should focus on one morphology-based segmentation method
Agreed, but if we had a good X-Y cut or other rule-based approach available (and wrapped) without too much effort, I would be happy to keep it alive as an alternative implementation. Each method has its weaknesses. (Ideally we can later combine different segmentations anyway.)
Agreed, but if we had a good X-Y cut or other rule-based approach available (and wrapped) without too much effort, I would be happy to keep it alive as an alternative implementation. Each method has its weaknesses. (Ideally we can later combine different segmentations anyway.)
This is now addressed (you could say solved) by cisocrgroup/ocrd_cis#47.
Also avoids the root cause of this issue in ocrd-tesserocr-segment-line (bad input from ocrd_cis), namely invalid polygons.
Original image: https://digi.ub.uni-heidelberg.de/diglitData/jb/02_-_arndt1710_-_000_096.tif (58 MB)