altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

ALTO - PAGE xml: Object mapping and possible transformation generation #48

Open Jo-CCS opened 6 years ago

Jo-CCS commented 6 years ago

On face-2-face conference in Vienna the idea came up to generate a conversion between PAGE and ALTO as best-practice mapping between the different standard objects. If feasible, a transformation could be provided by XSLT.

The idea is to create a mapping on the latest ALTO version 4 to upcoming PAGE version in June and from there going backwards as far this makes sense.

Target is to get a common solution for mapping especially for objects where no exact matching is possible and workarounds or compromises need to be defined.

chris1010010 commented 6 years ago

Document with list of features here: Doc

chris1010010 commented 4 years ago

I made a start here: prima-core-libs (Java) (XmlPageWriter_Alto.java) It can already convert the main things such as blocks, text lines, strings and glyphs with shapes. But there are many ToDos.

Some issues that need discussing:

The idea is to extend the JPageConverter to accept ALTO as target format. Already added but not tested: https://github.com/PRImA-Research-Lab/prima-page-converter

cneud commented 4 years ago

@chris1010010 This is great for a head start, many thanks! I will also circulate this within the @OCR-D community for comments and contributions.

chris1010010 commented 4 years ago

@cneud Happy to discuss priorities and sharing of work to keep the momentum. Thorough testing is a big chunk of work that can be easily distributed.

chris1010010 commented 4 years ago

I made some progress in the Java converter. Open issues: SP, HYP, margins

cneud commented 4 years ago

FYI there is also ongoing work in the German OCR SIG to complete what Christian started, cf. https://github.com/maxnth/page-alto-ressources and https://github.com/maxnth/prima-core-libs/branches

artunit commented 3 years ago

As per the 2021-04-29 Board Meeting, I am linking the ocrd-page-to-alto TODO list here, which gives a nice summary of missing equivalencies. Kudos to everyone who has worked on this.