Closed bertsky closed 3 years ago
But I'll probably add a few more commits to allow selecting a different (older) ALTO version – if you don't mind @kba?
But I'll probably add a few more commits to allow selecting a different (older) ALTO version – if you don't mind @kba?
I do not :) Just give me the heads-up when it's ready.
There's a problem though: Some breaking schema changes of the past have neither been reflected in a new namespace name or a new namespace version. Example:
Version 2.1 includes the following changes:
- Page and BlockType element HEIGHT, WIDTH, HPOS, VPOS attribute types changed to xsd:float from xsd:int.
Background BTW: https://github.com/kitodo/kitodo-presentation/issues/488
There's a problem though: Some breaking schema changes of the past have neither been reflected in a new namespace name or a new namespace version. Example:
Version 2.1 includes the following changes:
- Page and BlockType element HEIGHT, WIDTH, HPOS, VPOS attribute types changed to xsd:float from xsd:int.
I see no other way than to do a conditional int(value)
(or whatever the case( depending on the target version for these cases :/
I see no other way than to do a conditional
int(value)
(or whatever the case( depending on the target version for these cases :/
Yes. Luckily we decided to do all that in Python – and not impoverished XSLT
I see no other way than to do a conditional
int(value)
(or whatever the case( depending on the target version for these cases :/Yes. Luckily we decided to do all that in Python – and not impoverished XSLT
Thinking about it: in this direction, it's not a problem at all: we always get strings that can only be interpreted as int
from PAGE, which will parse as both int
and float
when written as ALTO.
I'll probably add a few more commits to allow selecting a different (older) ALTO version
Done.
diff -u <(page-to-alto --alto-version 2.0 --no-check-border --log-level OFF 00000012.page.xml) <(page-to-alto --alto-version 4.1 --no-check-border --log-level OFF 00000012.page.xml)
--- /dev/fd/63 2021-06-17 19:25:15.685117779 +0200
+++ /dev/fd/62 2021-06-17 19:25:15.685117779 +0200
@@ -1,5 +1,5 @@
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
-<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd">
+<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-1.xsd" SCHEMAVERSION="4.1">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
@@ -7,6 +7,9 @@
</sourceImageInformation>
</Description>
<Styles/>
+ <Tags>
+ <LayoutTag ID="layouttag-paragraph" LABEL="paragraph"/>
+ </Tags>
<Layout>
<Page ID="None" PHYSICAL_IMG_NR="0" WIDTH="2621" HEIGHT="2621">
<TopMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
@@ -15,13 +18,33 @@
<BottomMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
<PrintSpace>
<TextBlock ID="r0" HEIGHT="2130" WIDTH="1546" HPOS="190" VPOS="366" TAGREFS="layouttag-paragraph" IDNEXT="r1">
+ <Shape>
+ <Polygon POINTS="190,366 190,2496 1736,2496 1736,366"/>
+ </Shape>
<TextLine ID="r0-dummy-TextLine" HEIGHT="2130" WIDTH="1546" HPOS="190" VPOS="366">
- <String ID="r0-dummy-TextLine-dummy-Word" HEIGHT="2130" WIDTH="1546" HPOS="190" VPOS="366" CONTENT="übrigens denken, dass Sie aus diesem Vorfall Sich nur das Gute ziehen würden, wie es denn auch geschehen. Es freut mich, dass Sie Sich so dem trockenen Studium hingegeben, es muss eben auch sein, und trägt für später die schönsten Früchte. Was helfen die schönen, poetischen Gedanken, wenn man sie nicht zu behandeln weiss, die Instrumente Alle, wenn man nicht versteht sie mit Maass anzuwenden – damit erdrückt man seine schönsten Gedanken, macht sie ungeniessbar. Dies empfand ich namentlich auch bei Ihren Gesangssachen, die, so innig gedacht sie waren, gesungen unmöglich einen erquicklichen Eindruck machen konnten. Ich denke, das haben Sie jetzt auch eingesehen. "/>
+ <Shape>
+ <Polygon POINTS="190,366 190,2496 1736,2496 1736,366"/>
+ </Shape>
+ <String ID="r0-dummy-TextLine-dummy-Word" HEIGHT="2130" WIDTH="1546" HPOS="190" VPOS="366" CONTENT="übrigens denken, dass Sie aus diesem Vorfall Sich nur das Gute ziehen würden, wie es denn auch geschehen. Es freut mich, dass Sie Sich so dem trockenen Studium hingegeben, es muss eben auch sein, und trägt für später die schönsten Früchte. Was helfen die schönen, poetischen Gedanken, wenn man sie nicht zu behandeln weiss, die Instrumente Alle, wenn man nicht versteht sie mit Maass anzuwenden – damit erdrückt man seine schönsten Gedanken, macht sie ungeniessbar. Dies empfand ich namentlich auch bei Ihren Gesangssachen, die, so innig gedacht sie waren, gesungen unmöglich einen erquicklichen Eindruck machen konnten. Ich denke, das haben Sie jetzt auch eingesehen. ">
+ <Shape>
+ <Polygon POINTS="190,366 190,2496 1736,2496 1736,366"/>
+ </Shape>
+ </String>
</TextLine>
</TextBlock>
<TextBlock ID="r1" HEIGHT="2403" WIDTH="1636" HPOS="120" VPOS="83" TAGREFS="layouttag-paragraph">
+ <Shape>
+ <Polygon POINTS="1753,83 1756,340 180,336 180,2486 120,2483 123,120 583,120 1116,93"/>
+ </Shape>
<TextLine ID="r1-dummy-TextLine" HEIGHT="2403" WIDTH="1636" HPOS="120" VPOS="83">
- <String ID="r1-dummy-TextLine-dummy-Word" HEIGHT="2403" WIDTH="1636" HPOS="120" VPOS="83" CONTENT="Wie weit sind Sie mit Ihrem Sextett? Waren Sie schon in Paris? Was haben Sie dort Musikalisches erlebt? – Meinen Bruder haben Sie wohl im Harz gesehen? "/>
+ <Shape>
+ <Polygon POINTS="1753,83 1756,340 180,336 180,2486 120,2483 123,120 583,120 1116,93"/>
+ </Shape>
+ <String ID="r1-dummy-TextLine-dummy-Word" HEIGHT="2403" WIDTH="1636" HPOS="120" VPOS="83" CONTENT="Wie weit sind Sie mit Ihrem Sextett? Waren Sie schon in Paris? Was haben Sie dort Musikalisches erlebt? – Meinen Bruder haben Sie wohl im Harz gesehen? ">
+ <Shape>
+ <Polygon POINTS="1753,83 1756,340 180,336 180,2486 120,2483 123,120 583,120 1116,93"/>
+ </Shape>
+ </String>
</TextLine>
</TextBlock>
</PrintSpace>
@kba this is ready to merge AFAICT. (Would be nice to have automatic schema validation against the different versions in your testset, but I'll leave that to your diligent hands :-)
@bertsky For --alto-version 2.0
you still have to change the namespace. See your diff:
xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd"
The namespace is still ns-v4#
but has to be ns-v2#
.
Additionally, I tried to validate v2 files you generated and xmllint
gives me more errors. I guess you missed these because your namespace was still v4.
PrintSpace
is missing required attributes:
00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'HEIGHT' is required but missing. 00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'WIDTH' is required but missing. 00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'HPOS' is required but missing. 00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'VPOS' is required but missing.
TAGREFS
are not allowed:
00000003.xml:17: element TextBlock: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}TextBlock', attribute 'TAGREFS': The attribute 'TAGREFS' is not allowed. 00000003.xml:22: element TextBlock: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}TextBlock', attribute 'TAGREFS': The attribute 'TAGREFS' is not allowed.
The namespace is still
ns-v4#
but has to bens-v2#
.
Indeed, I forgot to adapt the namespace name as well. Thanks for spotting!
TAGREFS
are not allowed:00000003.xml:17: element TextBlock: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}TextBlock', attribute 'TAGREFS': The attribute 'TAGREFS' is not allowed.
That one slipped through, sry. Late at work!
Additionally, I tried to validate v2 files you generated and
xmllint
gives me more errors. I guess you missed these because your namespace was still v4.
PrintSpace
is missing required attributes:00000003.xml:16: element PrintSpace: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v2#}PrintSpace': The attribute 'HEIGHT' is required but missing.
That's harder. This was already broken before. @kba your dummy_printspace
case did not set the bbox attributes. But I wonder why you equate Border
with PrintSpace
by default. Looking at the definitions/descriptions of PAGE and ALTO, their respective PrintSpace
is identical. So it should be Border
, not PrintSpace
, acting as last-resort.
@Erikmitk I fixed the three issues you found.
which will parse as both
int
andfloat
when written as ALTO
True, since xs:int
and xs:float
both derive from xs:decimal
which allows decimal points, this should not be a problem.
which will parse as both
int
andfloat
when written as ALTOTrue, since
xs:int
andxs:float
both derive fromxs:decimal
which allows decimal points, this should not be a problem.
No, but because we simply don't ever generate decimal points (and xs:int
does not allow decimal points lexically (in literals), despite being derived semantically from xs:decimal
which does).
Happy to announce that page-to-alto --alto-version 2.0 --no-check-border --dummy-textline --dummy-word
does work well with DFG Viewer! (It even tolerates HTML-escaped newlines in the @CONTENT
. This looks even better than rendering of discrete TextLines – but hopefully Kitodo.Presentation will become better at the latter.)
Fixes #14.