OCR-D / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
13 stars 5 forks source link

fix schemaLocation syntax for ALTO ns #15

Closed bertsky closed 3 years ago

bertsky commented 3 years ago

Fixes #14.

bertsky commented 3 years ago

But I'll probably add a few more commits to allow selecting a different (older) ALTO version – if you don't mind @kba?

kba commented 3 years ago

But I'll probably add a few more commits to allow selecting a different (older) ALTO version – if you don't mind @kba?

I do not :) Just give me the heads-up when it's ready.

bertsky commented 3 years ago

There's a problem though: Some breaking schema changes of the past have neither been reflected in a new namespace name or a new namespace version. Example:

Version 2.1 includes the following changes:

  • Page and BlockType element HEIGHT, WIDTH, HPOS, VPOS attribute types changed to xsd:float from xsd:int.
bertsky commented 3 years ago

Background BTW: https://github.com/kitodo/kitodo-presentation/issues/488

kba commented 3 years ago

There's a problem though: Some breaking schema changes of the past have neither been reflected in a new namespace name or a new namespace version. Example:

Version 2.1 includes the following changes:

  • Page and BlockType element HEIGHT, WIDTH, HPOS, VPOS attribute types changed to xsd:float from xsd:int.

I see no other way than to do a conditional int(value) (or whatever the case( depending on the target version for these cases :/

bertsky commented 3 years ago

I see no other way than to do a conditional int(value) (or whatever the case( depending on the target version for these cases :/

Yes. Luckily we decided to do all that in Python – and not impoverished XSLT

bertsky commented 3 years ago

I see no other way than to do a conditional int(value) (or whatever the case( depending on the target version for these cases :/

Yes. Luckily we decided to do all that in Python – and not impoverished XSLT

Thinking about it: in this direction, it's not a problem at all: we always get strings that can only be interpreted as int from PAGE, which will parse as both int and float when written as ALTO.

bertsky commented 3 years ago

I'll probably add a few more commits to allow selecting a different (older) ALTO version

Done.

diff -u <(page-to-alto --alto-version 2.0 --no-check-border --log-level OFF 00000012.page.xml) <(page-to-alto --alto-version 4.1 --no-check-border --log-level OFF 00000012.page.xml)
--- /dev/fd/63  2021-06-17 19:25:15.685117779 +0200
+++ /dev/fd/62  2021-06-17 19:25:15.685117779 +0200
@@ -1,5 +1,5 @@
 <?xml version='1.0' encoding='UTF-8' standalone='yes'?>
-<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd">
+<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-1.xsd" SCHEMAVERSION="4.1">
   <Description>
     <MeasurementUnit>pixel</MeasurementUnit>
     <sourceImageInformation>
@@ -7,6 +7,9 @@
     </sourceImageInformation>
   </Description>
   <Styles/>
+  <Tags>
+    <LayoutTag ID="layouttag-paragraph" LABEL="paragraph"/>
+  </Tags>
   <Layout>
     <Page ID="None" PHYSICAL_IMG_NR="0" WIDTH="2621" HEIGHT="2621">
       <TopMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
@@ -15,13 +18,33 @@
       <BottomMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
       <PrintSpace>
         <TextBlock ID="r0" HEIGHT="2130" WIDTH="1546" HPOS="190" VPOS="366" TAGREFS="layouttag-paragraph" IDNEXT="r1">
+          <Shape>
+            <Polygon POINTS="190,366 190,2496 1736,2496 1736,366"/>
+          </Shape>
           <TextLine ID="r0-dummy-TextLine" HEIGHT="2130" WIDTH="1546" HPOS="190" VPOS="366">
-            <String ID="r0-dummy-TextLine-dummy-Word" HEIGHT="2130" WIDTH="1546" HPOS="190" VPOS="366" CONTENT="übrigens denken, dass Sie aus diesem Vorfall Sich nur das Gute ziehen würden, wie es denn auch geschehen. Es freut mich, dass Sie Sich so dem trockenen Studium hingegeben, es muss eben auch sein, und trägt für später die schönsten Früchte. Was helfen die schönen, poetischen Gedanken, wenn man sie nicht zu behandeln weiss, die Instrumente Alle, wenn man nicht versteht sie mit Maass anzuwenden – damit erdrückt man seine schönsten Gedanken, macht sie ungeniessbar. Dies empfand ich namentlich auch bei Ihren Gesangssachen, die, so innig gedacht sie waren, gesungen unmöglich einen erquicklichen Eindruck machen konnten. Ich denke, das haben Sie jetzt auch eingesehen. "/>
+            <Shape>
+              <Polygon POINTS="190,366 190,2496 1736,2496 1736,366"/>
+            </Shape>
+            <String ID="r0-dummy-TextLine-dummy-Word" HEIGHT="2130" WIDTH="1546" HPOS="190" VPOS="366" CONTENT="übrigens denken, dass Sie aus diesem Vorfall Sich nur das Gute ziehen würden, wie es denn auch geschehen. Es freut mich, dass Sie Sich so dem trockenen Studium hingegeben, es muss eben auch sein, und trägt für später die schönsten Früchte. Was helfen die schönen, poetischen Gedanken, wenn man sie nicht zu behandeln weiss, die Instrumente Alle, wenn man nicht versteht sie mit Maass anzuwenden – damit erdrückt man seine schönsten Gedanken, macht sie ungeniessbar. Dies empfand ich namentlich auch bei Ihren Gesangssachen, die, so innig gedacht sie waren, gesungen unmöglich einen erquicklichen Eindruck machen konnten. Ich denke, das haben Sie jetzt auch eingesehen. ">
+              <Shape>
+                <Polygon POINTS="190,366 190,2496 1736,2496 1736,366"/>
+              </Shape>
+            </String>
           </TextLine>
         </TextBlock>
         <TextBlock ID="r1" HEIGHT="2403" WIDTH="1636" HPOS="120" VPOS="83" TAGREFS="layouttag-paragraph">
+          <Shape>
+            <Polygon POINTS="1753,83 1756,340 180,336 180,2486 120,2483 123,120 583,120 1116,93"/>
+          </Shape>
           <TextLine ID="r1-dummy-TextLine" HEIGHT="2403" WIDTH="1636" HPOS="120" VPOS="83">
-            <String ID="r1-dummy-TextLine-dummy-Word" HEIGHT="2403" WIDTH="1636" HPOS="120" VPOS="83" CONTENT="Wie weit sind Sie mit Ihrem Sextett? Waren Sie schon in Paris? Was haben Sie dort Musikalisches erlebt? – Meinen Bruder haben Sie wohl im Harz gesehen? "/>
+            <Shape>
+              <Polygon POINTS="1753,83 1756,340 180,336 180,2486 120,2483 123,120 583,120 1116,93"/>
+            </Shape>
+            <String ID="r1-dummy-TextLine-dummy-Word" HEIGHT="2403" WIDTH="1636" HPOS="120" VPOS="83" CONTENT="Wie weit sind Sie mit Ihrem Sextett? Waren Sie schon in Paris? Was haben Sie dort Musikalisches erlebt? – Meinen Bruder haben Sie wohl im Harz gesehen? ">
+              <Shape>
+                <Polygon POINTS="1753,83 1756,340 180,336 180,2486 120,2483 123,120 583,120 1116,93"/>
+              </Shape>
+            </String>
           </TextLine>
         </TextBlock>
       </PrintSpace>
bertsky commented 3 years ago

@kba this is ready to merge AFAICT. (Would be nice to have automatic schema validation against the different versions in your testset, but I'll leave that to your diligent hands :-)

Erikmitk commented 3 years ago

@bertsky For --alto-version 2.0 you still have to change the namespace. See your diff:

xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd"

The namespace is still ns-v4# but has to be ns-v2#.

Erikmitk commented 3 years ago

Additionally, I tried to validate v2 files you generated and xmllint gives me more errors. I guess you missed these because your namespace was still v4.

PrintSpace is missing required attributes:

00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'HEIGHT' is required but missing. 00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'WIDTH' is required but missing. 00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'HPOS' is required but missing. 00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'VPOS' is required but missing.

TAGREFS are not allowed:

00000003.xml:17: element TextBlock: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}TextBlock', attribute 'TAGREFS': The attribute 'TAGREFS' is not allowed. 00000003.xml:22: element TextBlock: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}TextBlock', attribute 'TAGREFS': The attribute 'TAGREFS' is not allowed.

bertsky commented 3 years ago

The namespace is still ns-v4# but has to be ns-v2#.

Indeed, I forgot to adapt the namespace name as well. Thanks for spotting!

TAGREFS are not allowed:

00000003.xml:17: element TextBlock: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}TextBlock', attribute 'TAGREFS': The attribute 'TAGREFS' is not allowed.

That one slipped through, sry. Late at work!

Additionally, I tried to validate v2 files you generated and xmllint gives me more errors. I guess you missed these because your namespace was still v4.

PrintSpace is missing required attributes:

00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'HEIGHT' is required but missing.

That's harder. This was already broken before. @kba your dummy_printspace case did not set the bbox attributes. But I wonder why you equate Border with PrintSpace by default. Looking at the definitions/descriptions of PAGE and ALTO, their respective PrintSpace is identical. So it should be Border, not PrintSpace, acting as last-resort.

bertsky commented 3 years ago

@Erikmitk I fixed the three issues you found.

kba commented 3 years ago

which will parse as both int and float when written as ALTO

True, since xs:int and xs:float both derive from xs:decimal which allows decimal points, this should not be a problem.

bertsky commented 3 years ago

which will parse as both int and float when written as ALTO

True, since xs:int and xs:float both derive from xs:decimal which allows decimal points, this should not be a problem.

No, but because we simply don't ever generate decimal points (and xs:int does not allow decimal points lexically (in literals), despite being derived semantically from xs:decimal which does).

bertsky commented 3 years ago

Happy to announce that page-to-alto --alto-version 2.0 --no-check-border --dummy-textline --dummy-word does work well with DFG Viewer! (It even tolerates HTML-escaped newlines in the @CONTENT. This looks even better than rendering of discrete TextLines – but hopefully Kitodo.Presentation will become better at the latter.)