PAGE XML: explicit namespace prefixes are missing when writing new elements

Calamari-OCR / calamari

Line based ATR Engine based on OCRopy

Apache License 2.0

1.04k stars 209 forks source link

PAGE XML: explicit namespace prefixes are missing when writing new elements #310

Closed andbue closed 1 year ago

andbue commented 2 years ago

At the moment, calamari can read something like \<pc:TextLine> from file created with ocrd tools, but writes e.g. \<Word> elements without namespace prefix when saving predictions.

Maybe something like etree.SubElement(line, f'{{{line.nsmap[line.prefix]}}}TextEquiv', nsmap=line.nsmap)?

bertsky commented 1 year ago

@andbue I am also hitting this problem. You did close the issue, but I cannot see an actual fix in the code, neither on master https://github.com/Calamari-OCR/calamari/blob/e6b57c8e72c29ddaeeb302a519c7aa5a42fede55/calamari_ocr/ocr/dataset/datareader/pagexml/reader.py#L557 or on tempscale branch.

The above recipe sounds plausible – have you tried it yet? Should I make an attempt and PR it?

andbue commented 1 year ago

I have no idea why I closed this... My fix seems to work with dummy XML (both explicit prefix and "None"-prefix), I don't know if it behaves well with real data. This helper function

def makeSubElementNS(parent, tag, attrib=None):
    tag = '{' + parent.nsmap.get(parent.prefix, '') + '}' + tag
    return etree.SubElement(parent, tag, attrib=attrib, nsmap=parent.nsmap)

should work even if the namespace declaration is missing from the file for some reason.

If you could try that with some ocrd data and prepare a PR, this would be really helpful!

bertsky commented 1 year ago

That recipe works charmingly! (I have tested both on input with and without NS prefix, it now keeps consistency.) See #342.