Open wollmers opened 4 years ago
Do you suggest to remove each TextRegion
without text from the PAGE XML files? Ideally Transkribus should not write them in the first place.
Seems caused by a user of Transkribus, deleting the text content instead of deleting the TextRegion
. Or changing the TextRegion
to SeparatorRegion
.
IMHO we should keep them and just take care of it. There are 101 such cases. Deleting would also need renumbering readingOrder
.
There is also one missing TextEquiv
which needs skipping in scripts, i.e. check if it is defined before using it.
ERROR $TextEquiv not defined: ONB_ibn_19110701_008.tif TextRegion_id=region_1547026084602_43
The broken image just looks ugly in my proofreading webpage, because I don't catch the problem. Good enough for internal work.
Just for the records:
Checked one case by hand:
<Page imageFilename="ONB_ibn_19110701_024.tif" imageWidth="3168" imageHeight="4831">
[...]
<ReadingOrder>
<OrderedGroup id="ro_1567003993813" caption="Regions reading order">
[...]
<RegionRefIndexed index="60" regionRef="r_29_1"/>
[...]
<TextRegion type="paragraph" id="r_29_1" custom="readingOrder {index:60;}">
<Coords points="2051,4648 2062,4648 2062,4672 2051,4672"/>
<TextLine id="tl_94" primaryLanguage="German" custom="readingOrder {index:0;}">
<Coords points="2052,4649 2061,4649 2061,4671 2052,4671"/>
<Baseline points="2052,4671 2061,4671"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
<TextStyle fontFamily="Times New Roman" fontSize="6.0" italic="true"/>
</TextLine>
<TextEquiv>
<Unicode></Unicode>
</TextEquiv>
</TextRegion>
The corresponding area in the page image contains a speckle outside the print space (DE: "Satzspiegel").
PrintSpace
was not set by Transkribus as it is the same as image size:
<Page imageFilename="ONB_ibn_19110701_024.tif" imageWidth="3168" imageHeight="4831">
<PrintSpace>
<Coords points="0,0 3168,0 3168,4831 0,4831"/>
</PrintSpace>
Still skipping them because deleting would need a visual check in context of the page image.
But it's boring for several reasons.
The correct way would be to change the TextRegion
to NoiseRegion
and delete the entry in ReadingOrder
.
Shocked in the first moment I checked the history of git, if I deleted them by mistake. No, they never existed.
E. g.
In the XML they are empty:
Seems not an important problem. It's just I didn't expect the need to check and handle missing or empty files everywhere.