What tools are available for handling Page-XML?

proycon commented 3 years ago

What tools do we already have available in and around CLARIAH for dealing with Page-XML? This question arose mostly out of the Golden Agents project but I think we may just as well discuss it in the CLARIAH text group here.

We developed one tool in the scope of foliautils:

FoLiA-page - Converts Page-XML to FoLiA, incorporates references to the original document and can translate existing word items. (designed as a preprocessing step to enable ticcl to work on Page-XML input)

I have one main question:

Do we have a tool that can interpret the coordinate information present in Page-XML and extract the proper 'reading order' for elements? Especially in the context of multi-column layout? (FoLiA-page doesn't do this). I think I heard @rvankoert mentioned in the Golden Agents meeting as possibly having a solution for this?

Tagging @marijnkoolen, @gijsjan and @LvanWissen, @menzowindhouwer for respectively the Republic and Golden Agents projects.

@gijsjan: If I'm not mistaken, Docere visualizes Page XML and the source image right?

marijnkoolen commented 3 years ago

For Republic, I have halfway decent generic Python code to read PageXML and parse into some default elements from the physical structure: scan, page, column, textregion, textline, word. The page and column elements are generated by my code, as they're not part of the PageXML spec.

Actually, page is part of PageXML, but it's actually the whole scan, which can be more or less than a single physical page, so I distinguish between scan (the whole image) and page (some region of the image that should correspond to a single page). Of course, which region corresponds to a page is image-dependent and project-specific and cannot be determined generically.

I think PageXML has some way to express reading order, but this is also pretty much image-dependent and strongly differs per scan and can be difficult to determine. So whether the reading order makes sense depends on whether this has been explicitly made part of the document model, or has been trained on or some such.

I also have terrible Python code that I'm currently updating for modelling elements from the logical structure (e.g. chapters, sections, paragraphs, tables, or in the case of Republic: resolutions, attendance lists, index entries, etc.). Logical elements can also be hierarchical, and at each level, they can have elements from the physical structure, so there is a correspondence between logical elements and the PageXML and the image coordinates.

But the logical structure is very project-specific so I doubt there is much that can be made generic.

proycon commented 3 years ago

@marijnkoolen Thanks! I suppose most of the code you mention is in https://github.com/HuygensING/republic-project ? I'll have a browse around there. Even though some things may be project-specific, it might still be useful or worth expanding upon for Golden Agents.

rvankoert commented 3 years ago

Determining reading order in general is not an easy task and depends highly on the specific documents. In republic it's a fairly straight-forward layout most of the time. For other projects i also deal with more complex layouts ranging from circular layout to newspapers to annotations on annotations on annotations on annotations. There is some tooling available, but most is in the experimental phase or can only deal with simple layouts. In the ICDAR conference it is still a topic of active research.

Reading order can be set in PageXML. For wp2 we will create some stuff that does basic reading order detection.

Best, Rutger

Op vr 9 apr. 2021 om 14:40 schreef Maarten van Gompel < @.***>:

@marijnkoolen https://github.com/marijnkoolen Thanks! I suppose most of the code you mention is in https://github.com/HuygensING/republic-project ? I'll have a browse around there. Even though some things may be project-specific, it might still be useful or worth expanding upon for Golden Agents.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CLARIAH/IG-Text/issues/10#issuecomment-816653190, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYFSEROFQZTPT3VRJJBYPTTH3YS3ANCNFSM42S5MYKA .

marijnkoolen commented 3 years ago

@marijnkoolen Thanks! I suppose most of the code you mention is in https://github.com/HuygensING/republic-project ? I'll have a browse around there. Even though some things may be project-specific, it might still be useful or worth expanding upon for Golden Agents.

@proycon Please wait a few days before having a look. I'm currently rewriting the PageXML parsing bit and have created new classes for those generic elements, but still have to push my code. Once done, I'll let you know where you can find that.

proycon commented 3 years ago

@marijnkoolen Ok! Thanks!

marijnkoolen commented 3 years ago

Okay @proycon, I've pushed some updates. A relatively generic PageXML parser (though it only assume elements and properties there are used in the Republic PageXML output, not the full PageXML spec) can be found here: https://github.com/HuygensING/republic-project/blob/master/republic/parser/pagexml/generic_pagexml_parser.py

The parser returns a PageXMLScan object that consists of smaller objects (PageXMLTextRegion, PageXMLTextLine and PageXMLWord), which are part of the physical document structure (and inherit from the PhysicalStructureDoc class) and therefore have coordinates in the scan. There's also a generic class for the logical structure which can contain elements from the physical structure, bit itself has no direct connection to the scan.

The document models are here: https://github.com/HuygensING/republic-project/blob/master/republic/model/physical_document_model.py

No doubt there's still a lot of Republic-specific stuff going on, but I think the most important part is the distinction between physical and logical structures and how they map onto each other. Doing that right saves a lot of headache later on.

Anyway, I hope it can be of some use. If you think it's worth reusing, I should probably turn that part of the code into it's own repo.

proycon commented 3 years ago

Thanks for the sources! I'll have to take a deeper look still, but it will hopefully prevent doing unnecessary duplicate work!

CLARIAH / IG-Text

What tools are available for handling Page-XML? #10