Closed mikegerber closed 1 month ago
The XSL which is used for that conversion is too simple to handle more complex PAGE XML. I don't know whether a better XSL is available from other projects.
Did you try whether JPageConverter
which is already used for the PAGE to ALTO conversion does a better job? As far as I know it can also produce text.
I would consider this a serious bug, not an enhancement.
It's both. PAGE XML is complex, so I would never expect a perfect tool which supports all of its features.
It's not imperfection by not supporting some features, it's producing a wrong result if it's not honoring the reading order, for a lot of real world PAGE XML files.
The texts in the XML file also look strange when I look at them with less
or vi
("Mglikeit"). Do you use some special encoding? It's not UTF-8!
The file in https://github.com/UB-Mannheim/ocr-fileformat/issues/138#issue-895785528 was created (by a SBB contractor) using Aletheia and uses their encoding scheme, which uses a lot of PUA characters, which in part is based on MUFI (See (https://www.primaresearch.org/www/assets/tools/Special%20Characters%20in%20Aletheia.pdf)). So it's UTF-8, but with private characters. But encoding is an entirely different beast :-) (dinglehopper-extract gives different characters due to normalization, but that's not the issue here.)
Sry, did not see this earlier. But I had the exact same use case. It's not so difficult to properly handle PAGE reading order in XSLT 1.0. This was solved along with https://github.com/UB-Mannheim/ocr-fileformat/pull/151.
(You can even pass XSLT parameters for what hierarchy level you want to extract from (default is highest) or what separators to use for concatenation: https://github.com/UB-Mannheim/ocr-fileformat/blob/3e32ef632ff439710d123ba700364703d07b47a9/xslt/page__text.xsl#L14-L21
See
ocr-transform page text --help-args
Probably fixed in #151
page__text.xsl is not honoring the reading order in the PAGE-XML (
pc:ReadingOrder
), which gives completely false results. For this page, I get this text (shortened):For comparison,
dinglehopper-extract
gives the correct text:Image from the ZIP (converted to JPEG), for easier understanding: