OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
39 stars 11 forks source link

Original files being copied? #92

Closed jbarth-ubhd closed 5 years ago

jbarth-ubhd commented 5 years ago

Here my minimal mets example:

<?xml version="1.0" encoding="UTF-8"?>
<mets:mets xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:vls="http://semantics.de/vls" xmlns:mets="http://www.loc.gov/METS/" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd">
  <mets:dmdSec ID="dmdSec_0001">
    <mets:mdWrap MDTYPE="MODS">
      <mets:xmlData>
        <mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
          <mods:identifier type="purl">http://www.deutschestextarchiv.de/wundt_grundriss_1896</mods:identifier>
        </mods:mods>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>
  <mets:fileSec>
    <mets:fileGrp USE="OCR-D-IMG">
      <mets:file MIMETYPE="image/jpeg" ID="OCR-D-IMG_0001">
        <mets:FLocat LOCTYPE="OTHER" xlink:href="x/06.jpg" OTHERLOCTYPE="FILE"/>
      </mets:file>
    </mets:fileGrp>
  </mets:fileSec>
  <mets:structMap TYPE="LOGICAL">
    <mets:div TYPE="Monograph" DMDID="dmdSec_0001" ID="loc_0001">
    </mets:div>
  </mets:structMap>
  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence" ID="physroot">
      <mets:div ID="phys_0001" TYPE="page" DMDID="DMGT_0001" ORDER="1">
        <mets:fptr FILEID="OCR-D-IMG_0001"/>
      </mets:div>
    </mets:div>
  </mets:structMap>
</mets:mets>

Why is the file x/06.jpg being copied to OCR-D-IMG/OCR-D-IMG_0001.jpg after using ocrd-tesserocr-deskew -I OCR-D-IMG -O OCR-D-DESKEW ?

To prevent naming conflicts afterwards?

kba commented 5 years ago

Thanks for reporting. No, this is a bug not intended behavior. Input images shouldn't be copied in this instance since they are locally available. Tracking this in https://github.com/OCR-D/core/issues/342 since it's an issue in core.