PRImA-Research-Lab / PAGE-XML

PAGE XML format collection for document image page content and more
Apache License 2.0
63 stars 8 forks source link

Add unrestricted custom elements #19

Open bertsky opened 4 years ago

bertsky commented 4 years ago

I don't know if this is related to #18, but generally people have been trying to add annotations to (earlier versions of) PAGE's PcGts, and – due to the lack of support for a free sub-namespace – have spawned their own namespace.

PAGE-XML does offer @custom and @comments attributes everywhere and the predefined UserDefinedType, but this is not nearly as powerful/expressive as an arbitrary XML subtree.

In comparison, ALTO-XML has XmlData for that purpose, and it uses:

<xsd:any namespace="##any" processContents="lax" maxOccurs="unbounded"/>

One example of where this could be useful is for holding an OCR hypotheses lattice without changing the namespace.

bertsky commented 4 years ago

@chris1010010 would you care for a PR?

chris1010010 commented 4 years ago

I think the majority vote was not to use 'any' for the time being, but I'll raise this again with the others. How about 'anyAttribute' for selected elements? Would that be useful?

bertsky commented 4 years ago

How about 'anyAttribute' for selected elements? Would that be useful?

What do you mean?

What ALTO does is allow (any number of) elements XmlData under TagType (i.e. /alto/Tags/LayoutTag|StructureTag|RoleTag|NamedEntityTag|OtherTag/XmlData which can then have arbitrary child elements of an arbitrary namespace.

For PAGE we would still have to decide under which path such free content elements make most sense. Maybe Labels (which is under Metadata and Page and all regions)?