PRImA-Research-Lab / prima-core-libs

Core libraries by the PRImA Research Lab
Apache License 2.0
16 stars 15 forks source link

reader ignores index in ordered groups #13

Open bertsky opened 3 years ago

bertsky commented 3 years ago

AFAICS, the existing implementations for all versions of PAGE-XML ignore (OrderedGroup|OrderedGroupIndexed)/@index when parsing the XML.

This is how it looks:

https://github.com/PRImA-Research-Lab/prima-core-libs/blob/1f087a4378f58a34c83176ab0ffb620dd8b78f2d/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2019_07_15.java#L335-L342

References for ATTR_index are nowhere to be found.

The model class of the group in turn does nothing on its part to check incoming indices, it simply appends them:

https://github.com/PRImA-Research-Lab/prima-core-libs/blob/1f087a4378f58a34c83176ab0ffb620dd8b78f2d/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Group.java#L193-L199

This means that applications like PageViewer or PageConverter will use the XML order instead of the actual order laid out by the schema semantics. Which in turn creates a problem for applications like OCR-D: What is the correct representation, the one shown by PageViewer or my strict implementation?

Here's an example of the difference this can make:

In sharp contrast to what one might suspect superficially, here it's PageViewer who gets the order wrong – along with the producing tool eynollah (which follows its model of just looking at the XML order), hence a compensatory error.

If my interpretation is wrong, please get back to me soonish for confirmation. (I don't care about the fix so much as clarity on the correct meaning of the standard for implementation in software and adoption in derived specifications like OCR-D.)

If the better place is the PAGE-XML repo, please transfer.

mikegerber commented 3 years ago

I would also be very happy to know what PRImA-Research-Lab's view on the index value here is. 😀 I would interpret the schema description in the same way as @bertsky and I, too, think that the implementation in PAGE Viewer is therefore wrong as shown in the example. (In the example, XML order = correct reading order but the index values are essentially random values. These essentially random values should be interpreted as the order if our interpretation of the schema is correct.)