Open Jo-CCS opened 6 years ago
Document with list of features here: Doc
I made a start here: prima-core-libs (Java) (XmlPageWriter_Alto.java) It can already convert the main things such as blocks, text lines, strings and glyphs with shapes. But there are many ToDos.
Some issues that need discussing:
The idea is to extend the JPageConverter to accept ALTO as target format. Already added but not tested: https://github.com/PRImA-Research-Lab/prima-page-converter
@chris1010010 This is great for a head start, many thanks! I will also circulate this within the @OCR-D community for comments and contributions.
@cneud Happy to discuss priorities and sharing of work to keep the momentum. Thorough testing is a big chunk of work that can be easily distributed.
I made some progress in the Java converter. Open issues: SP, HYP, margins
FYI there is also ongoing work in the German OCR SIG to complete what Christian started, cf. https://github.com/maxnth/page-alto-ressources and https://github.com/maxnth/prima-core-libs/branches
As per the 2021-04-29 Board Meeting, I am linking the ocrd-page-to-alto TODO list here, which gives a nice summary of missing equivalencies. Kudos to everyone who has worked on this.
On face-2-face conference in Vienna the idea came up to generate a conversion between PAGE and ALTO as best-practice mapping between the different standard objects. If feasible, a transformation could be provided by XSLT.
The idea is to create a mapping on the latest ALTO version 4 to upcoming PAGE version in June and from there going backwards as far this makes sense.
Target is to get a common solution for mapping especially for objects where no exact matching is possible and workarounds or compromises need to be defined.