OCR-D / ocrd_calamari

Recognize text using Calamari OCR and the OCR-D framework
Apache License 2.0
13 stars 6 forks source link

Tests broken since last update #73

Closed mikegerber closed 2 years ago

mikegerber commented 2 years ago

Since the last update, the tests are broken:

------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------
11:00:07.844 INFO processor.CalamariRecognize - INPUT FILE 0 / phys_0001
--------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------
INFO     processor.CalamariRecognize:recognize.py:81 INPUT FILE 0 / phys_0001
================================================================== short test summary info ==================================================================
FAILED test/test_recognize.py::test_recognize - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you...
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model - requests.exceptions.MissingSchema: Invalid URL 'OC...
FAILED test/test_recognize.py::test_word_segmentation - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Per...
FAILED test/test_recognize.py::test_glyphs - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you me...
==================================================================== 4 failed in 16.04s =====================================================================
make: *** [Makefile:77: test] Error 1

Observations:

The new code from @bertsky's change in https://github.com/OCR-D/ocrd_calamari/commit/1f0252d0d7d1cffe76bc1f3626a536fe84106eff should download OCR-D-IMG/INPUT_0017.tif but doesn't:

% ls /tmp/test-ocrd-calamari/OCR-D-IMG 
OCR-D-IMG_0001.tif  OCR-D-IMG_0002.tif
mikegerber commented 2 years ago

The "downloaded" images' filenames are made from the mets:file's ID:

   <mets:fileGrp USE="OCR-D-IMG">
      <mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0001">
        <mets:FLocat LOCTYPE="URL" xlink:href="OCR-D-IMG/INPUT_0017.tif"/>
      </mets:file>
      <mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0002">
        <mets:FLocat LOCTYPE="URL" xlink:href="OCR-D-IMG/INPUT_0020.tif"/>
      </mets:file>
    </mets:fileGrp>
mikegerber commented 2 years ago

With an old(!) checkout of test/assets I did not have these fails with this new code, so this may be worth investigating.

mikegerber commented 2 years ago

With an old(!) checkout of test/assets

See also #72.

bertsky commented 2 years ago

I think this is caused by a change in assets: https://github.com/OCR-D/assets/commit/b12e5ebc12450bd70e9ec7a9d7afeb48f6201773, which was supposed to fix https://github.com/OCR-D/assets/issues/87, but does not work. Here is a debug log of what actually happens when copying the workspace to a temporary location:

DEBUG    ocrd.resolver.workspace_from_url:resolver.py:164 workspace_from_url
mets_basename='mets.xml'
mets_url='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml'
src_baseurl='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data'
dst_dir='/tmp/test-ocrd-calamari'
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG    ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml' to '/tmp/test-ocrd-calamari/mets.xml'
DEBUG    ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/>  [_recursion_count=0]
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG    ocrd.workspace.download_file:workspace.py:158 First run of resolver.download_to_directory(OCR-D-IMG/INPUT_0017.tif) failed, try prepending baseurl '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data': File path passed as 'url' to download_to_directory does not exist: OCR-D-IMG/INPUT_0017.tif
DEBUG    ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/>  [_recursion_count=1]
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG    ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif' to '/tmp/test-ocrd-calamari/OCR-D-IMG/OCR-D-IMG_0001.tif'

So, essentially, Resolver.workspace_from_url undoes the non-standard path names when downloading, and subsequently the @imageFilename reference does not work (again).

@kba I suppose we could fix this in assets by using standard basenames, but it looks more like a bug in core to me.

mikegerber commented 2 years ago

Relevant parts of test_recognize.py:

METS_KANT = assets.url_of('kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml')                                                                 
WORKSPACE_DIR = '/tmp/test-ocrd-calamari'                                                                                                                    

    resolver = Resolver()                                                                                                                                    
    workspace = resolver.workspace_from_url(METS_KANT, dst_dir=WORKSPACE_DIR)                                                                                

    for imgf in workspace.mets.find_files(fileGrp="OCR-D-IMG"):                                                                                              
        imgf = workspace.download_file(imgf)
        print(imgf)                                                                                                             

This clones the workspace from test/assets and doesn't give the correct local filenames:

<OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/OCR-D-IMG_0001.tif, local_filename=OCR-D-IMG/OCR-D-IMG_0001.tif]/> 
<OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0002, mimetype=image/tiff, url=OCR-D-IMG/OCR-D-IMG_0002.tif, local_filename=OCR-D-IMG/OCR-D-IMG_0002.tif]/>