Closed mikegerber closed 2 years ago
The "downloaded" images' filenames are made from the mets:file
's ID
:
<mets:fileGrp USE="OCR-D-IMG">
<mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0001">
<mets:FLocat LOCTYPE="URL" xlink:href="OCR-D-IMG/INPUT_0017.tif"/>
</mets:file>
<mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0002">
<mets:FLocat LOCTYPE="URL" xlink:href="OCR-D-IMG/INPUT_0020.tif"/>
</mets:file>
</mets:fileGrp>
With an old(!) checkout of test/assets I did not have these fails with this new code, so this may be worth investigating.
With an old(!) checkout of test/assets
See also #72.
I think this is caused by a change in assets: https://github.com/OCR-D/assets/commit/b12e5ebc12450bd70e9ec7a9d7afeb48f6201773, which was supposed to fix https://github.com/OCR-D/assets/issues/87, but does not work. Here is a debug log of what actually happens when copying the workspace to a temporary location:
DEBUG ocrd.resolver.workspace_from_url:resolver.py:164 workspace_from_url
mets_basename='mets.xml'
mets_url='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml'
src_baseurl='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data'
dst_dir='/tmp/test-ocrd-calamari'
DEBUG ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml' to '/tmp/test-ocrd-calamari/mets.xml'
DEBUG ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/> [_recursion_count=0]
DEBUG ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG ocrd.workspace.download_file:workspace.py:158 First run of resolver.download_to_directory(OCR-D-IMG/INPUT_0017.tif) failed, try prepending baseurl '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data': File path passed as 'url' to download_to_directory does not exist: OCR-D-IMG/INPUT_0017.tif
DEBUG ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/> [_recursion_count=1]
DEBUG ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif' to '/tmp/test-ocrd-calamari/OCR-D-IMG/OCR-D-IMG_0001.tif'
So, essentially, Resolver.workspace_from_url
undoes the non-standard path names when downloading, and subsequently the @imageFilename
reference does not work (again).
@kba I suppose we could fix this in assets by using standard basenames, but it looks more like a bug in core to me.
Relevant parts of test_recognize.py
:
METS_KANT = assets.url_of('kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml')
WORKSPACE_DIR = '/tmp/test-ocrd-calamari'
resolver = Resolver()
workspace = resolver.workspace_from_url(METS_KANT, dst_dir=WORKSPACE_DIR)
for imgf in workspace.mets.find_files(fileGrp="OCR-D-IMG"):
imgf = workspace.download_file(imgf)
print(imgf)
This clones the workspace from test/assets
and doesn't give the correct local filenames:
<OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/OCR-D-IMG_0001.tif, local_filename=OCR-D-IMG/OCR-D-IMG_0001.tif]/>
<OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0002, mimetype=image/tiff, url=OCR-D-IMG/OCR-D-IMG_0002.tif, local_filename=OCR-D-IMG/OCR-D-IMG_0002.tif]/>
Since the last update, the tests are broken:
Observations:
The new code from @bertsky's change in https://github.com/OCR-D/ocrd_calamari/commit/1f0252d0d7d1cffe76bc1f3626a536fe84106eff should download
OCR-D-IMG/INPUT_0017.tif
but doesn't: