docs/ocrd-training: export from OCR-D toolchain

bertsky commented 4 years ago

I am not sure I have a good grasp of what is ultimately intended by docs/ocrd-training.md, but as it stands, I think the page should at least link to (or better describe) the 2 very options we currently have to extract line images and respective metadata from PAGE-XML annotations:

page2img.py: minimal dependencies (no ocrd, only lxml), but also minimal functionality
ocrd-segment-extract-lines: normal OCR-D processor, capable of utilising/respecting all information the workflow provides...
- AlternativeImage anywhere along the hierarchy (e.g. binarization or dewarping)
- @orientation on page or region level (i.e. cropping the minimal bounding box after deskewing)
- Coords/@points as polygon not just bounding box (masking the pixels outside; optionally with alpha channel)
- provide line text (.gt.txt) and meta-data (IDs among PAGE hierarchy and METS, script/language features etc, region @type, page @type, image preprocessing features, image DPI value)

kba commented 4 years ago

That page was supposed to provide a "running start" for @Doreenruirui when she started working on what would become okralact. It is true though that we should provide an actual guide on training and your suggestions are welcome.

bertsky commented 4 years ago

Understood.

Another thing that this page or guide should mention is converters for page segmentation training data. With ocrd-segment-from-masks and ocrd-segment-from-coco we have 2 importers and with the debug images and coco output of ocrd-segment-extract-pages we have 2 exporters for commonly used non-PAGE formats.

bertsky commented 3 years ago

Can perhaps be closed – there's a section on the ocrd_segment converters in https://ocr-d.de/en/workflows#step-19-format-conversion now. (And page2img is independent of OCR-D and most OCR tools: tesstrain will probably include its own PAGE converter and Calamari already does. If you do mention it somewhere, then please don't forget https://github.com/uniwue-zpd/PAGETools, too.)

kba commented 1 year ago

I think these are now adressed and the originally referenced page removed, so closing.

OCR-D / ocrd-website

docs/ocrd-training: export from OCR-D toolchain #101