OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
72 stars 17 forks source link

empty OCR #412

Closed jbarth-ubhd closed 8 months ago

jbarth-ubhd commented 9 months ago

with this workflow

singocrd ocrd workspace init
singocrd ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/
►tiff OCR-D-IMG/00001.tif
singocrd ocrd-sbb-binarize -P model default-2021-03-09 -I OCR-D-IMG -O OCR-D-001
singocrd ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002
singocrd ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-002 -O OCR-D-003
singocrd ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D
►-004
singocrd ocrd-tesserocr-segment -P find_tables true -P shrink_polygons true -I
► OCR-D-004 -O OCR-D-005
singocrd ocrd-calamari-recognize -P checkpoint_dir $HOME/ocrd_models/ocrd-
►calamari-recognize/qurator-gt4histocr-1.0 -I OCR-D-005 -O OCR-D-OCR

there is no text in OCR-D-OCR*.xml

All files (see run.sh for workflow and ocrd.log for log):

https://digi.ub.uni-heidelberg.de/diglitData/v/christliche_kunstblaetter1862--08--empty-ocr.zip

bertsky commented 9 months ago

https://digi.ub.uni-heidelberg.de/diglitData/v/christliche_kunstblaetter1862--08--empty-ocr.zip

It says HTTP 403 to me.

jbarth-ubhd commented 9 months ago

Wrong permissions after scp?! ... Please try again.

jbarth-ubhd commented 9 months ago

OCR-D-OCR...xml is missing in zip archive, therefor I post it here:

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07
►-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="
►http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.
►primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="OCR-
►D-OCR_00001.IMG-BIN.IMG-CROP.IMG-DESKEW.IMG-BIN">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 2.63.0</pc:Creator>
        <pc:Created>2024-03-01T12:48:46.945457</pc:Created>
        <pc:LastChange>2024-03-01T12:48:46.945457</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="recognition/text-
►recognition" value="ocrd-calamari-recognize">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="/home/hd/hd_hd/hd_wu120/ocrd_models/ocrd-
►calamari-recognize/qurator-gt4histocr-1.0" type="checkpoint_dir"/>
                <pc:Label value="confidence_voter_default_ctc" type="voter"/>
                <pc:Label value="line" type="textequiv_level"/>
                <pc:Label value="0.001" type="glyph_conf_cutoff"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="1.0.6 (calamari 1.0.6, tensorflow 2.13.1)" type
►="ocrd-calamari-recognize"/>
                <pc:Label value="2.63.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>
    <pc:Page imageFilename="OCR-D-005/OCR-D-005_00001.IMG-BIN.IMG-CROP.IMG-
►DESKEW.IMG-BIN.png" imageWidth="2229" imageHeight="2942"/>
</pc:PcGts>
bertsky commented 9 months ago

That says it all. We are chasing the same bug (regression) that haunts us everywhere now, see https://github.com/OCR-D/ocrd_tesserocr/issues/201. (Last I checked, I could not reproduce though.)

mikegerber commented 9 months ago

This has the same invalid physical structMap we saw elsewhere:

  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence">
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-IMG_00001"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-001_00001.IMG-BIN"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-002_00001.IMG-BIN.IMG-CROP"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-003_00001.IMG-BIN.IMG-CROP-BIN_wolf"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-003_00001.IMG-BIN.IMG-CROP"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-004_00001.IMG-BIN.IMG-CROP.IMG-DESKEW"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-005_00001.IMG-BIN.IMG-CROP.IMG-DESKEW.IMG-BIN"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-OCR_00001.IMG-BIN.IMG-CROP.IMG-DESKEW.IMG-BIN"/>
      </mets:div>
    </mets:div>
  </mets:structMap>

@jbarth-ubhd Did you produce this with an ocrd/all Docker image?

jbarth-ubhd commented 9 months ago

With this ocrd.sif from docker ocrd/all maximum : 8687316992 2024-02-21 15:30:33 +0100 ocrd.sif

mikegerber commented 9 months ago

I was using roughly the same version, I think. I have no experience with singularity but i was using the maximum image from a few days ago.

mikegerber commented 9 months ago

That says it all. We are chasing the same bug (regression) that haunts us everywhere now, see OCR-D/ocrd_tesserocr#201. (Last I checked, I could not reproduce though.)

@bertsky Just out of curiosity: What is wrong with that part of the XML?

bertsky commented 9 months ago

@mikegerber

Just out of curiosity: What is wrong with that part of the XML?

that the original image is referencing the derived image (from deskewing). It's essentially what happens if the METS is broken in the way your snippet shows.

I can reproduce this now – even without workspace add.

bertsky commented 9 months ago

I can now say that it's a caching issue. If I run with OCRD_METS_CACHING=0, then the problem disappears.

The default in the Docker builds is now OCRD_METS_CACHING=1: https://github.com/OCR-D/ocrd_all/blob/5af34e7d18e88147f3300f2f3a2d2bf81cddc880/Dockerfile#L51

jbarth-ubhd commented 9 months ago

Did add this to my singularity ocrd.env, helps.

mikegerber commented 9 months ago

This is https://github.com/OCR-D/core/issues/1195

bertsky commented 8 months ago

It required a new core v2.63.3 to appear on PyPI, then a rebuild of ocrd/core and then of ocrd/all:* before this was actually fixed.