Closed bertsky closed 1 year ago
That page was supposed to provide a "running start" for @Doreenruirui when she started working on what would become okralact. It is true though that we should provide an actual guide on training and your suggestions are welcome.
Understood.
Another thing that this page or guide should mention is converters for page segmentation training data. With ocrd-segment-from-masks
and ocrd-segment-from-coco
we have 2 importers and with the debug images and coco output of ocrd-segment-extract-pages
we have 2 exporters for commonly used non-PAGE formats.
Can perhaps be closed – there's a section on the ocrd_segment converters in https://ocr-d.de/en/workflows#step-19-format-conversion now. (And page2img is independent of OCR-D and most OCR tools: tesstrain will probably include its own PAGE converter and Calamari already does. If you do mention it somewhere, then please don't forget https://github.com/uniwue-zpd/PAGETools, too.)
I think these are now adressed and the originally referenced page removed, so closing.
I am not sure I have a good grasp of what is ultimately intended by
docs/ocrd-training.md
, but as it stands, I think the page should at least link to (or better describe) the 2 very options we currently have to extract line images and respective metadata from PAGE-XML annotations:ocrd
, onlylxml
), but also minimal functionalityAlternativeImage
anywhere along the hierarchy (e.g. binarization or dewarping)@orientation
on page or region level (i.e. cropping the minimal bounding box after deskewing)Coords/@points
as polygon not just bounding box (masking the pixels outside; optionally with alpha channel).gt.txt
) and meta-data (IDs among PAGE hierarchy and METS, script/language features etc, region@type
, page@type
, image preprocessing features, image DPI value)