UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

Missing line files (txt and png) #16

Open wollmers opened 4 years ago

wollmers commented 4 years ago

Shocked in the first moment I checked the history of git, if I deleted them by mistake. No, they never existed.

E. g.

ONB_nfp_19110701_006.tif_tl_6.gt.txt
ONB_nfp_19110701_006.tif_tl_6.png

In the XML they are empty:

    <TextRegion type="paragraph" id="r_6_1" custom="readingOrder {index:5;}">
      <Coords points="3235,186 3239,186 3239,190 3235,190"/>
      <TextLine id="tl_6" primaryLanguage="German" custom="readingOrder {index:0;}">
        <Coords points="3236,187 3238,187 3238,189 3236,189"/>
        <Baseline points="3236,189 3238,189"/>
        <TextEquiv>
          <Unicode/>
        </TextEquiv>
        <TextStyle fontFamily="Times New Roman" fontSize="5.0" bold="true" italic="true"/>
      </TextLine>
      <TextEquiv>
        <Unicode></Unicode>
      </TextEquiv>
    </TextRegion>

Seems not an important problem. It's just I didn't expect the need to check and handle missing or empty files everywhere.

stweil commented 4 years ago

Do you suggest to remove each TextRegion without text from the PAGE XML files? Ideally Transkribus should not write them in the first place.

wollmers commented 4 years ago

Seems caused by a user of Transkribus, deleting the text content instead of deleting the TextRegion. Or changing the TextRegion to SeparatorRegion.

IMHO we should keep them and just take care of it. There are 101 such cases. Deleting would also need renumbering readingOrder.

There is also one missing TextEquivwhich needs skipping in scripts, i.e. check if it is defined before using it.

ERROR $TextEquiv not defined: ONB_ibn_19110701_008.tif TextRegion_id=region_1547026084602_43

The broken image just looks ugly in my proofreading webpage, because I don't catch the problem. Good enough for internal work.

Bildschirmfoto 2020-07-13 um 10 30 16

wollmers commented 3 years ago

Just for the records:

Checked one case by hand:

  <Page imageFilename="ONB_ibn_19110701_024.tif" imageWidth="3168" imageHeight="4831">
[...]
    <ReadingOrder>
      <OrderedGroup id="ro_1567003993813" caption="Regions reading order">
[...]
        <RegionRefIndexed index="60" regionRef="r_29_1"/>
[...]
    <TextRegion type="paragraph" id="r_29_1" custom="readingOrder {index:60;}">
      <Coords points="2051,4648 2062,4648 2062,4672 2051,4672"/>
      <TextLine id="tl_94" primaryLanguage="German" custom="readingOrder {index:0;}">
        <Coords points="2052,4649 2061,4649 2061,4671 2052,4671"/>
        <Baseline points="2052,4671 2061,4671"/>
        <TextEquiv>
          <Unicode/>
        </TextEquiv>
        <TextStyle fontFamily="Times New Roman" fontSize="6.0" italic="true"/>
      </TextLine>
      <TextEquiv>
        <Unicode></Unicode>
      </TextEquiv>
    </TextRegion>

The corresponding area in the page image contains a speckle outside the print space (DE: "Satzspiegel").

PrintSpace was not set by Transkribus as it is the same as image size:

  <Page imageFilename="ONB_ibn_19110701_024.tif" imageWidth="3168" imageHeight="4831">
    <PrintSpace>
      <Coords points="0,0 3168,0 3168,4831 0,4831"/>
    </PrintSpace>

Still skipping them because deleting would need a visual check in context of the page image.

But it's boring for several reasons.

The correct way would be to change the TextRegion to NoiseRegion and delete the entry in ReadingOrder.