page_validator.py produces wrong concatenated text

mikegerber commented 4 years ago

In get_text(), the TextEquiv with index=1 is used if it exists. The way I read the documentation of the index attribute in the PAGE schema, it should use the one with the lowest index:

Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.

The lowest possible value for index is 0, according to the schema.

mikegerber commented 4 years ago

Example workspace: actevedef_718448162.zip

ocrd workspace validate --page-coordinate-consistency off mets.xml
[...]
  <error>INCONSISTENCY in Word ID 'l1130_word0020' of file 'OUTPUT_00000024': text results 'Notarus' != concatenated 'Notaris'</error>

The problematic glyph in OUTPUT_00000024:

<pc:Glyph id="l1130_word0020_glyph0005">
    <pc:Coords points="1864,3587 1886,3587 1886,3646 1864,3646"/>
    <pc:TextEquiv index="0" conf="0.6683819890022278">
        <pc:Unicode>u</pc:Unicode>
    </pc:TextEquiv>
    <pc:TextEquiv index="1" conf="0.29328230023384094">
        <pc:Unicode>i</pc:Unicode>
    </pc:TextEquiv>

    <!-- more indexes omitted -->

</pc:Glyph>

bertsky commented 4 years ago

In get_text(), the TextEquiv with index=1 is used if it exists. The way I read the documentation of the index attribute in the PAGE schema, it should use the one with the lowest index:

I concur. Thanks for reporting!

The lowest possible value for index is 0, according to the schema.

Yes, but that does not mean we have to look for index zero. Instead, we should abide by the wording of the schema more closely: taking the smallest index (whatever that may be).

Same goes for set_text BTW.

kba commented 4 years ago

I concur. Thanks for reporting!

Indeed. Cannot remember why we did it this way.

bertsky commented 4 years ago

Also, along with the fix, we should rename page_textequiv_strategy=index1 to page_textequiv_strategy=first. (There are not further references to index1 outside of that source file currently, except for the validate CLI and the respective test.)

mikegerber commented 4 years ago

Yes, but that does not mean we have to look for index zero. Instead, we should abide by the wording of the schema more closely: taking the smallest index (whatever that may be).

Absolutely. In my current implementation in ocrd_calamari there could also be missing index values (due to unrelated reasons), which should be perfectly legal.

bertsky commented 4 years ago

In my current implementation in ocrd_calamari there could also be missing index values (due to unrelated reasons), which should be perfectly legal.

Right, but what if some processors add textequiv with index and some without? Then we can get a mix. Do we only sort by index if all alternatives possess one (and use XML element ordering otherwise), or do we use the smallest index if at least one alternative does?

kba commented 4 years ago

@tboenig Can you remember why we implemented index1 instead ofan index0 strategy?

kba commented 4 years ago

It does say so [in our PAGE specs]:

`@index` of the first (preferred) `<pg:TextEquiv>` must be the value 1.

I'm fairly certain I had a reason for that, could that be the convenion of Aletheia or TRANSKRIBUS?

mikegerber commented 4 years ago

In my current implementation in ocrd_calamari there could also be missing index values (due to unrelated reasons), which should be perfectly legal.

Right, but what if some processors add textequiv with index and some without? Then we can get a mix. Do we only sort by index if all alternatives possess one (and use XML element ordering otherwise), or do we use the smallest index if at least one alternative does?

I would read the schema's description "Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content. " that index should be used to define the order/precedence if there are multiple TextEquivs. If you're not using it (which seems to be legal), order is undefined.

(To be clear: With "missing index values" in my implementation I meant that there might be index="0" and index="2" but no index="1" in some cases, but there is always a unique index value.)

mikegerber commented 4 years ago

It does say so [in our PAGE specs]:
`@index` of the first (preferred) `<pg:TextEquiv>` must be the value 1.
I'm fairly certain I had a reason for that, could that be the convenion of Aletheia or TRANSKRIBUS?

I have some files here that we're created using Aletheia, they only have "solo" TextEquivs with no index attributes.

bertsky commented 4 years ago

I would read the schema's description ... that index should be used to define the order/precedence if there are multiple TextEquivs. If you're not using it (which seems to be legal), order is undefined.

Of course, TextEquivs without index would render the order undefined by the PAGE spec, but I was asking about opinions on what our implementation should prefer under these circumstance. As mentioned, we can easily get a mix of index/non-index textequivs.

but there is always a unique index value

No there is not: That attribute is optional, so the generateDS DOM (correctly) parses this as None when absent.

cneud commented 4 years ago

could that be the convention of Aletheia or TRANSKRIBUS?

The convention in Aletheia is top-to-bottom starting from top-left bounding box based on LowLevelTextContainerImpl.java | GeometricObjectPositionComparator.java.

Unless a sequence is explicitely defined via readingOrder, the increment of index* is disregarded.

mikegerber commented 4 years ago

could that be the convention of Aletheia or TRANSKRIBUS?

The convention in Aletheia is top-to-bottom starting from top-left bounding box based on LowLevelTextContainerImpl.java | GeometricObjectPositionComparator.java.

Unless a sequence is explicitely defined via readingOrder, the increment of index* is disregarded.

That's about the order of different segments, here it's about order of multiple TextEquivs for the same segment, e.g. multiple alternative predictions for the same glyph. For example this "u" where ocrd_calamari predicted an "i" alternatively, with less confidence:

<pc:Glyph id="l1130_word0020_glyph0005">
    <pc:Coords points="1864,3587 1886,3587 1886,3646 1864,3646"/>
    <pc:TextEquiv index="0" conf="0.6683819890022278">
        <pc:Unicode>u</pc:Unicode>
    </pc:TextEquiv>
    <pc:TextEquiv index="1" conf="0.29328230023384094">
        <pc:Unicode>i</pc:Unicode>
    </pc:TextEquiv>

    <!-- more indexes omitted -->

</pc:Glyph>

(I would prefer the word "precedence" over "order" as it seems less confusing.)

cneud commented 4 years ago

Well, while for String the index attribute is optional, for GraphemeBaseType it is required with <restriction base="int"> and <minInclusive value="0"/>. IIRC alternative predictions have only ever been considered on the character/glyph/grapheme level.

mikegerber commented 4 years ago

I chose to adhere to the stricter¹ OCR-D convention of starting with 1 for now (https://github.com/OCR-D/ocrd_calamari/commit/0f9c94e7dc4f4577ec1465a1cb0613d310941728).

¹ I also think it is needlessly stricter but I don't care that much to argue about it any longer :)

bertsky commented 4 years ago

It does say so [in our PAGE specs]:
`@index` of the first (preferred) `<pg:TextEquiv>` must be the value 1.
I'm fairly certain I had a reason for that, could that be the convenion of Aletheia or TRANSKRIBUS?

I doubt that for Aletheia: text variants are definitely 0-indexed in prima-core-libs.

As for Transkribus, I could not find usage of TextEquiv/@index in our (various versions of) textual GT anywhere at all. (Not sure if that says anything).

Regarding the question where the idea of index1 (as opposed to index0 or first) originated, here is my reconstruction:

initially, the textual inconsistency freedom/problem of PAGE was overlooked in the spec
errors in the GT prompted @tboenig and me to ask for a spec refinement and core validator: https://github.com/OCR-D/assets/issues/16 – subsequent discussion shifted attention to the problem of textual inconsistency in the presence of alternatives
index1 principle first appeared in the initial formulation in https://github.com/OCR-D/spec/pull/82 and implementation in https://github.com/OCR-D/core/pull/223 – there were discussions on alternative strategies (subsumption check, best check) and on interfaces, but no one challenged index==1

mikegerber commented 4 years ago

I think #432 was closed by mistake?

kba commented 4 years ago

Sorry about that, #432 is now in master.

kba commented 4 years ago

@mikegerber This issue should have been fixed by #432 which is now in master. The initially mentioned error does not happen anymore, though there are still errors, also for that line:

<report>
<!-- ... -->
  <error>INCONSISTENCY in TextLine ID 'l1130' of file 'OUTPUT_00000024': text results 'Der Schnltheiß zu Oberrod, der Wirth Krebs und Hr. Notarus Tribert ſind bereits' != concatenated 'Der Schnltheiß zu Oberrod , der Wirth Krebs und Hr . Notarus Tribert ſind bereits'</error>
<!-- ... -->
</report>

mikegerber commented 4 years ago

  <error>INCONSISTENCY in TextLine ID 'l1130' of file 'OUTPUT_00000024': text results 'Der Schnltheiß zu Oberrod, der Wirth Krebs und Hr. Notarus Tribert ſind bereits' != concatenated 'Der Schnltheiß zu Oberrod , der Wirth Krebs und Hr . Notarus Tribert ſind bereits'</error>

This is an actual problem in the XML (generated with ocrd_calamari before 0.0.6), as the , and . are separate words and therefore concatenate wrongly (= not according to OCR-D PAGE specs).

kba commented 4 years ago

Thanks, closing then.

mikegerber commented 4 years ago

However, I do still get the "Notaris" error, among others, with master:

<error>INCONSISTENCY in Word ID 'l1130_word0020' of file 'OUTPUT_00000024': text results 'Notarus' != concatenated 'Notaris'</error>

@kba Could you please upload the full report you get from ocrd workspace validate --skip dimension --page-coordinate-consistency off, I might have a problem with my environment and testing the wrong version.

kba commented 4 years ago

I always make sure to run make install PIP_INSTALL="pip install -e" in core to make sure core has been installed "editable".

ocrd workspace validate --skip dimension --page-coordinate-consistency off

17:32:59.037 INFO ocrd.page_validator - Validating input file 'OCR-D-GT-PAGE_00000024'
17:32:59.481 INFO ocrd.page_validator - Validating input file 'OUTPUT_00000024'
<report valid="false">
  <error>INCONSISTENCY in TextLine ID 'l2159' of file 'OCR-D-GT-PAGE_00000024': text results 'eine ſo große Verwandſaﬀt, daß ſo gar in legibus einem einigen Verbreen⸗ wie der Conſpirationi &' != concatenated 'eine ſo große einigen der Conſpirationi & wie in legibus einem Verwandſaﬀt, daß ſo gar'</error>
  <error>INCONSISTENCY in TextLine ID 'l19' of file 'OUTPUT_00000024': text results '[22]' != concatenated '[ 22 ]'</error>
  <error>INCONSISTENCY in TextLine ID 'l32' of file 'OUTPUT_00000024': text results '[22' != concatenated '[ 22'</error>
  <error>INCONSISTENCY in TextLine ID 'l1250' of file 'OUTPUT_00000024': text results 'ein gleiches vorgegeben, und ſo gar ſehr viele mahle gegen alle menſchliche Moͤglichkeit mit Gewalt tor-' != concatenated 'ein gleiches vorgegeben , und ſo gar ſehr viele mahle gegen alle menſchliche Moͤglichkeit mit Gewalt tor -'</error>
  <error>INCONSISTENCY in TextLine ID 'l108' of file 'OUTPUT_00000024': text results 'ciret worden zu ſeyn, behaupten will, mithin nebſt dem Bredeka, welcher (§. 28. 29.) ſich in allen ſeinen' != concatenated 'ciret worden zu ſeyn , behaupten will , mithin nebſt dem Bredeka , welcher ( § . 28 . 29 . ) ſich in allen ſeinen'</error>
  <error>INCONSISTENCY in TextLine ID 'l212' of file 'OUTPUT_00000024': text results 'Auſſagen wiederſprochen, mit der Pœna talſi um do gewiſſer zu belegen iſt, da' != concatenated 'Auſſagen wiederſprochen , mit der Pœna talſi um do gewiſſer zu belegen iſt , da'</error>
  <error>INCONSISTENCY in TextLine ID 'l294' of file 'OUTPUT_00000024': text results 'ſecund. Fatin. Tit. 9. qu. 6. p . 320.' != concatenated 'ſecund . Fatin . Tit . 9 . qu . 6 . p . 320 .'</error>
  <error>INCONSISTENCY in TextLine ID 'l361' of file 'OUTPUT_00000024': text results 'die Klage ſo wohl als das Zeignuͤß vos falſch und erdichtet muͤßen gehalten werden.' != concatenated 'die Klage ſo wohl als das Zeignuͤß vos falſch und erdichtet muͤßen gehalten werden .'</error>
  <error>INCONSISTENCY in TextLine ID 'l446' of file 'OUTPUT_00000024': text results 'S. 35) So viel die von der Inquiſitin' != concatenated 'S . 35 ) So viel die von der Inquiſitin'</error>
  <error>INCONSISTENCY in TextLine ID 'l2048' of file 'OUTPUT_00000024': text results 'rath mit einer Pœna fiſcali angeſehen worden, und ſolche durch des Hrn. Graffen von Koͤnigsfeld Vor⸗' != concatenated 'rath mit einer Pœna fiſcali angeſehen worden , und ſolche durch des Hrn . Graffen von Koͤnigsfeld Vor ⸗'</error>
  <error>INCONSISTENCY in TextLine ID 'l99' of file 'OUTPUT_00000024': text results 'ſpruch, nur aus Gnaden nachgelaſſen erhalten.' != concatenated 'ſpruch , nur aus Gnaden nachgelaſſen erhalten .'</error>
  <error>INCONSISTENCY in TextLine ID 'l149' of file 'OUTPUT_00000024': text results 'Sondern man hat auch dieſen 4. Wochen lang alle Abend bey der Inquiſitin gantz allein gelaſſen.' != concatenated 'Sondern man hat auch dieſen 4 . Wochen lang alle Abend bey der Inquiſitin gantz allein gelaſſen .'</error>
  <error>INCONSISTENCY in TextLine ID 'l240' of file 'OUTPUT_00000024': text results 'Binnen welcher gantzer Zeit der Schreiber Bredeka beſtaͤndig bey Jhme geweſen, und ſich in' != concatenated 'Binnen welcher gantzer Zeit der Schreiber Bredeka beſtaͤndig bey Jhme geweſen , und ſich in'</error>
  <error>INCONSISTENCY in TextLine ID 'l328' of file 'OUTPUT_00000024': text results 'der am 13ten Octobr. a. c. in fudicio gegen ſeinen geweſenen Hrn. intröducirter Appellation deſſen Bey⸗' != concatenated 'der am 13ten Octobr . a . c . in fudicio gegen ſeinen geweſenen Hrn . intröducirter Appellation deſſen Bey ⸗'</error>
  <error>INCONSISTENCY in TextLine ID 'l431' of file 'OUTPUT_00000024': text results 'raths bedienet hat;' != concatenated 'raths bedienet hat ;'</error>
  <error>INCONSISTENCY in TextLine ID 'l466' of file 'OUTPUT_00000024': text results '.z) Dabenebenſt iſt der Schreiber binnen dieſer gantzen Zeit auf freyem Fuß geblieben, und' != concatenated '. z ) Dabenebenſt iſt der Schreiber binnen dieſer gantzen Zeit auf freyem Fuß geblieben , und'</error>
  <error>INCONSISTENCY in TextLine ID 'l563' of file 'OUTPUT_00000024': text results 'hat nicht nur durch ſeinen Coſuletten, ſondern auch, weilen der Inquiſitii ſelbſten in Jhrem Gefaͤngnuͤß' != concatenated 'hat nicht nur durch ſeinen Coſuletten , ſondern auch , weilen der Inquiſitii ſelbſten in Jhrem Gefaͤngnuͤß'</error>
  <error>INCONSISTENCY in TextLine ID 'l663' of file 'OUTPUT_00000024': text results 'ſo viele Freyheit gelaſſen worden, daß ſie frembden Beſuch von Jhren Anverwandten ohngehindert em⸗' != concatenated 'ſo viele Freyheit gelaſſen worden , daß ſie frembden Beſuch von Jhren Anverwandten ohngehindert em ⸗'</error>
  <error>INCONSISTENCY in TextLine ID 'l761' of file 'OUTPUT_00000024': text results 'pfangen koͤnnen, durch andere Perſonen ſich mit ihr uͤber alles, was Er oder ſie dereinſten zu ſagen hat⸗' != concatenated 'pfangen koͤnnen , durch andere Perſonen ſich mit ihr uͤber alles , was Er oder ſie dereinſten zu ſagen hat ⸗'</error>
  <error>INCONSISTENCY in TextLine ID 'l868' of file 'OUTPUT_00000024': text results 'ten, vereinigen koͤnnen, immaſſen der Hofrath Senckenberg, als dieſer am 1. Octob. das Officium Jcdi-' != concatenated 'ten , vereinigen koͤnnen , immaſſen der Hofrath Senckenberg , als dieſer am 1 . Octob . das Officium Jcdi -'</error>
  <error>INCONSISTENCY in TextLine ID 'l965' of file 'OUTPUT_00000024': text results 'cis gegen ihn zur ſatisfactione publica excitirete, vor ſich aber ratione injuriarum demſelben (eben § præced.' != concatenated 'cis gegen ihn zur ſatisfactione publica excitirete , vor ſich aber ratione injuriarum demſelben ( eben § præced .'</error>
  <error>INCONSISTENCY in TextLine ID 'l1071' of file 'OUTPUT_00000024': text results 'geſagter maſſen) eine Leibes⸗Straͤffe aufzulegen bate, vor allen Dingen, gleich als ob Er ein peinlicher' != concatenated 'geſagter maſſen ) eine Leibes ⸗ Straͤffe aufzulegen bate , vor allen Dingen , gleich als ob Er ein peinlicher'</error>
  <error>INCONSISTENCY in TextLine ID 'l1179' of file 'OUTPUT_00000024': text results 'Anklaͤger waͤre, und ohne indiciis denuneiiret haͤtte,' != concatenated 'Anklaͤger waͤre , und ohne indiciis denuneiiret haͤtte ,'</error>
  <error>INCONSISTENCY in TextLine ID 'l1254' of file 'OUTPUT_00000024': text results 'deauf dieſem Fall inioid. Cr. art. 12. vom peinlichen Klaͤger erforderte' != concatenated 'deauf dieſem Fall inioid . Cr . art . 12 . vom peinlichen Klaͤger erforderte'</error>
  <error>INCONSISTENCY in TextLine ID 'l1326' of file 'OUTPUT_00000024': text results 'Caution zu leiſten, auferleget worden, da man ſich doch ex Actis (vid. §. 31. haͤtte erſehen koͤnnen, daß' != concatenated 'Caution zu leiſten , auferleget worden , da man ſich doch ex Actis ( vid . § . 31 . haͤtte erſehen koͤnnen , daß'</error>
  <error>INCONSISTENCY in TextLine ID 'l1427' of file 'OUTPUT_00000024': text results 'hier von einer ohnzweiffentlichen und offentlichen Miſſethat die Frage obwalte, wobey dem Richter' != concatenated 'hier von einer ohnzweiffentlichen und offentlichen Miſſethat die Frage obwalte , wobey dem Richter'</error>
  <error>INCONSISTENCY in TextLine ID 'l1523' of file 'OUTPUT_00000024': text results 'in O. Cr. art. 16.' != concatenated 'in O . Cr . art . 16 .'</error>
  <error>INCONSISTENCY in TextLine ID 'l1558' of file 'OUTPUT_00000024': text results 'in gantz anderer ex Officio anzuſtellender Proceß vorgeſchrieben wird und allenfalls, wenn uͤber die' != concatenated 'in gantz anderer ex Officio anzuſtellender Proceß vorgeſchrieben wird und allenfalls , wenn uͤber die'</error>
  <error>INCONSISTENCY in TextLine ID 'l1654' of file 'OUTPUT_00000024': text results 'inlufficientia Iidiciorum ein Zweiffel obgewaltet haͤtte,' != concatenated 'inlufficientia Iidiciorum ein Zweiffel obgewaltet haͤtte ,'</error>
  <error>INCONSISTENCY in TextLine ID 'l1722' of file 'OUTPUT_00000024': text results 'ſeeund. O Cr. art. 7.' != concatenated 'ſeeund . O Cr . art . 7 .'</error>
  <error>INCONSISTENCY in TextLine ID 'l1758' of file 'OUTPUT_00000024': text results 'auswaͤrtige Rechtsgelaͤhrte haͤtten muͤſſen befraget werden, anſonſten aber bey der bloßen actione Injuria-' != concatenated 'auswaͤrtige Rechtsgelaͤhrte haͤtten muͤſſen befraget werden , anſonſten aber bey der bloßen actione Injuria -'</error>
  <error>INCONSISTENCY in TextLine ID 'l1857' of file 'OUTPUT_00000024': text results 'rum dem Hofrath Senckenberg die Cautions Leiſtung um do weniger konnte auferleget werden, da ſolche' != concatenated 'rum dem Hofrath Senckenberg die Cautions Leiſtung um do weniger konnte auferleget werden , da ſolche'</error>
  <error>INCONSISTENCY in TextLine ID 'l1956' of file 'OUTPUT_00000024': text results 'auch bey der Inhafftirung der Agricola von Jhm keinesweges ware erfordert worden.' != concatenated 'auch bey der Inhafftirung der Agricola von Jhm keinesweges ware erfordert worden .'</error>
  <error>INCONSISTENCY in TextLine ID 'l2042' of file 'OUTPUT_00000024': text results '§ 34) Zwiſchen dem Crimine falſi und concuſſionis iſt' != concatenated '§ 34 ) Zwiſchen dem Crimine falſi und concuſſionis iſt'</error>
  <error>INCONSISTENCY in TextLine ID 'l2097' of file 'OUTPUT_00000024': text results 'ſec. LAUTERB. Coll. Theot. Pract. Lib. 48. Tit. 10. §. 16.' != concatenated 'ſec . LAUTERB . Coll . Theot . Pract . Lib . 48 . Tit . 10 . § . 16 .'</error>
  <error>INCONSISTENCY in TextLine ID 'l2159' of file 'OUTPUT_00000024': text results 'erne ſo große Verwandſchafft, daß ſo gar in legibus einem einigen Verrechen⸗wie der Conſpirationi &' != concatenated 'erne ſo große Verwandſchafft , daß ſo gar in legibus einem einigen Verrechen ⸗ wie der Conſpirationi &'</error>
  <error>INCONSISTENCY in TextLine ID 'l2259' of file 'OUTPUT_00000024': text results 'ſubornationi Teſtium bald dieſer bald jenet Nahme beygeleget wird.' != concatenated 'ſubornationi Teſtium bald dieſer bald jenet Nahme beygeleget wird .'</error>
  <error>INCONSISTENCY in TextLine ID 'l2330' of file 'OUTPUT_00000024': text results 'L. 2. de concuſſ I. t. der. Cornel. de fall.' != concatenated 'L . 2 . de concuſſ I . t . der . Cornel . de fall .'</error>
  <error>INCONSISTENCY in TextLine ID 'l2384' of file 'OUTPUT_00000024': text results 'Da nun der Inquiſirin dieſes Crien allſchon voͤllig erwieſen worden (. 22.) und dieſelbe, wenn fie auch' != concatenated 'Da nun der Inquiſirin dieſes Crien allſchon voͤllig erwieſen worden ( . 22 . ) und dieſelbe , wenn fie auch'</error>
  <error>INCONSISTENCY in TextLine ID 'l2482' of file 'OUTPUT_00000024': text results 'ohngeſtandenen falls zu einem wahren Zeugnuͤß ſuborniret haͤtte,' != concatenated 'ohngeſtandenen falls zu einem wahren Zeugnuͤß ſuborniret haͤtte ,'</error>
  <error>INCONSISTENCY in TextLine ID 'l2556' of file 'OUTPUT_00000024': text results 'ſec. LATERs. Coll. Theor. Pract. L. 48. T. 10. §. 8.' != concatenated 'ſec . LATERs . Coll . Theor . Pract . L . 48 . T . 10 . § . 8 .'</error>
  <error>INCONSISTENCY in TextLine ID 'l2612' of file 'OUTPUT_00000024': text results 'dennoch mit der pœna falſi, als falſum fieri curans,' != concatenated 'dennoch mit der pœna falſi , als falſum fieri curans ,'</error>
  <error>INCONSISTENCY in TextLine ID 'l2670' of file 'OUTPUT_00000024': text results 'ſec. l. 0. 6. 3. ad L. Corn. de fali.' != concatenated 'ſec . l . 0 . 6 . 3 . ad L . Corn . de fali .'</error>
  <error>INCONSISTENCY in TextLine ID 'l2714' of file 'OUTPUT_00000024': text results 'L.4. 8. C. e. 7 X. de fali.' != concatenated 'L . 4 . 8 . C . e . 7 X . de fali .'</error>
  <error>INCONSISTENCY in TextLine ID 'l25' of file 'OUTPUT_00000024': text results 'muͤßte beleget werden,/ welche dann oben (§. 3i) geſagter maſſen die Straffe der Enthauptung iſt/ wie viel⸗' != concatenated 'muͤßte beleget werden , / welche dann oben ( § . 3i ) geſagter maſſen die Straffe der Enthauptung iſt / wie viel ⸗'</error>
  <error>INCONSISTENCY in TextLine ID 'l2860' of file 'OUTPUT_00000024': text results 'mehr wird derſelben und Jhrem Complici Bredekaw dieſe Straffe angedeyhen muͤſſen, da dieſelbe extra' != concatenated 'mehr wird derſelben und Jhrem Complici Bredekaw dieſe Straffe angedeyhen muͤſſen , da dieſelbe extra'</error>
  <error>INCONSISTENCY in TextLine ID 'l2960' of file 'OUTPUT_00000024': text results 'Judicium beſtaͤndig behauptet, daß ſie der Hofrath Senckenberg mit Gewalt⸗und ſo gar it Piſtolen zu' != concatenated 'Judicium beſtaͤndig behauptet , daß ſie der Hofrath Senckenberg mit Gewalt ⸗ und ſo gar it Piſtolen zu'</error>
  <error>INCONSISTENCY in TextLine ID 'l3060' of file 'OUTPUT_00000024': text results 'ſeinem Willen gezwungen,' != concatenated 'ſeinem Willen gezwungen ,'</error>
  <error>INCONSISTENCY in TextLine ID 'l3102' of file 'OUTPUT_00000024': text results 'Protoc. Inquiſ. fol. 71. b. fol73. b. 82. a. b. fol. 23. a.' != concatenated 'Protoc . Inquiſ . fol . 71 . b . fol73 . b . 82 . a . b . fol . 23 . a .'</error>
  <error>INCONSISTENCY in TextLine ID 'l3168' of file 'OUTPUT_00000024': text results 'auch in Judicio,' != concatenated 'auch in Judicio ,'</error>
  <error>INCONSISTENCY in TextLine ID 'l50' of file 'OUTPUT_00000024': text results 'antzegebene Zeugin belanget, ſo muß zwar, ſo viel Teſt. 1. neml. des aͤltern Hx. Burgermeiſters hoch⸗' != concatenated 'antzegebene Zeugin belanget , ſo muß zwar , ſo viel Teſt . 1 . neml . des aͤltern Hx . Burgermeiſters hoch ⸗'</error>
  <error>INCONSISTENCY in TextLine ID 'l92' of file 'OUTPUT_00000024': text results 'wohlgebl. anbetrifft, der Hofrath Senckenberg zu ſeinem groͤßten Leidweeſen bekennen, daß Er dieſelbe,' != concatenated 'wohlgebl . anbetrifft , der Hofrath Senckenberg zu ſeinem groͤßten Leidweeſen bekennen , daß Er dieſelbe ,'</error>
  <error>INCONSISTENCY in TextLine ID 'l189' of file 'OUTPUT_00000024': text results '(nach Veranlaſſung§. 16. 17. 18. 19.) vor einen Inimicum angeben muͤße, woferne jedoch annoch ein Pro⸗' != concatenated '( nach Veranlaſſung § . 16 . 17 . 18 . 19 . ) vor einen Inimicum angeben muͤße , woferne jedoch annoch ein Pro ⸗'</error>
  <error>INCONSISTENCY in TextLine ID 'l287' of file 'OUTPUT_00000024': text results 'ceß gegen den Hofrath Senckenberg ſtatt haben koͤnnte, und nicht' != concatenated 'ceß gegen den Hofrath Senckenberg ſtatt haben koͤnnte , und nicht'</error>
  <error>INCONSISTENCY in TextLine ID 'l350' of file 'OUTPUT_00000024': text results 'contra Q Cr. art. 100.' != concatenated 'contra Q Cr . art . 100 .'</error>
  <error>INCONSISTENCY in TextLine ID 'l399' of file 'OUTPUT_00000024': text results 'wie ſonſten hier gewoͤhnlich, articuli impertinentes oder dergleichen Interrogatoria zugelaſſen/ auch die von' != concatenated 'wie ſonſten hier gewoͤhnlich , articuli impertinentes oder dergleichen Interrogatoria zugelaſſen / auch die von'</error>
  <error>INCONSISTENCY in TextLine ID 'l577' of file 'OUTPUT_00000024': text results 'ſec. cap. accedens 23. X. de accus.' != concatenated 'ſec . cap . accedens 23 . X . de accus .'</error>
  <error>INCONSISTENCY in TextLine ID 'l625' of file 'OUTPUT_00000024': text results 'nichr zugelaſſen wird, duͤrfften dieſelbe vielleicht um do ehender vernommen werden, weilen alles ohne⸗' != concatenated 'nichr zugelaſſen wird , duͤrfften dieſelbe vielleicht um do ehender vernommen werden , weilen alles ohne ⸗'</error>
  <error>INCONSISTENCY in TextLine ID 'l717' of file 'OUTPUT_00000024': text results 'hin. ex Originaiibus zu erweiſen ſtehet.' != concatenated 'hin . ex Originaiibus zu erweiſen ſtehet .'</error>
  <error>INCONSISTENCY in TextLine ID 'l782' of file 'OUTPUT_00000024': text results '§. 36) Was von dem Bredekaw, der Seitzin und deren Sohn zu halten, iſt oben (s. 25. 26. 27.' != concatenated '§ . 36 ) Was von dem Bredekaw , der Seitzin und deren Sohn zu halten , iſt oben ( s . 25 . 26 . 27 .'</error>
  <error>INCONSISTENCY in TextLine ID 'l875' of file 'OUTPUT_00000024': text results '28.) erinnert worden.' != concatenated '28 . ) erinnert worden .'</error>
  <error>INCONSISTENCY in TextLine ID 'l926' of file 'OUTPUT_00000024': text results 'Mein Laquays Græf darff, wann gegen mich annoch ein Proceß ſtatt hatte, mmerhin verhoͤhret' != concatenated 'Mein Laquays Græf darff , wann gegen mich annoch ein Proceß ſtatt hatte , mmerhin verhoͤhret'</error>
  <error>INCONSISTENCY in TextLine ID 'l1012' of file 'OUTPUT_00000024': text results 'werden.' != concatenated 'werden .'</error>
  <error>INCONSISTENCY in TextLine ID 'l1053' of file 'OUTPUT_00000024': text results 'Die Wagnerin und deren Mann haben allſchon gegen die Inquiſitin ausgeſagt.' != concatenated 'Die Wagnerin und deren Mann haben allſchon gegen die Inquiſitin ausgeſagt .'</error>
  <error>INCONSISTENCY in TextLine ID 'l1130' of file 'OUTPUT_00000024': text results 'Der Schnltheiß zu Oberrod, der Wirth Krebs und Hr. Notarus Tribert ſind bereits' != concatenated 'Der Schnltheiß zu Oberrod , der Wirth Krebs und Hr . Notarus Tribert ſind bereits'</error>
  <error>INCONSISTENCY in TextLine ID 'l1203' of file 'OUTPUT_00000024': text results 'abgehoͤret.' != concatenated 'abgehoͤret .'</error>
  <notice>fileGrp USE does not begin with 'OCR-D-': OUTPUT</notice>
</report>

mikegerber commented 4 years ago

Your results are as expected (only whitespace-related problems from the XML itself).

(I have an issue that I – unexpectingly – wasn't calling the version from the virtualenv but the buggy installation in ~/.local instead.) Calling the correct version I get the same report as yours, so the bug is fixed.

OCR-D / core

page_validator.py produces wrong concatenated text #430