cisocrgroup / ocrd_cis

OCR-D python tools
MIT License
33 stars 12 forks source link

ocrd-cis-ocropy-recognize: 'ascii' codec can't decode byte 0xa9 #41

Closed jbarth-ubhd closed 4 years ago

jbarth-ubhd commented 4 years ago

models:

> find . -name *.pyrnn|xargs md5sum
bb90b17321987002afa6b94e650d16fa  ./venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models/fraktur.pyrnn
ef3238cd60cb1c35ede74573c8d14766  ./venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models/fraktur-jze.pyrnn

file: https://digi.ub.uni-heidelberg.de/diglitData/jb/ocropy-test.jpg

command:

> ocrd-make -f crop-anyocr-binarize-page-olena-sauvola-denoise-ocropy-deskew-page-ocropy-segment-tesseract-ocropy-dewarp-ocr-ocropy-tesseract.`mk 
make: Entering directory '/home/jb/workspace/ocrd/ocrd4dwork'
building OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP from OCR-D-SEG-LINE-tesseract-ocropy-DEWARP with pattern rule for ocrd-cis-ocropy-recognize
ocrd workspace remove-group -r OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP 2>/dev/null || true
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE-tesseract-ocropy-DEWARP -O OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP -p OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.json 2>&1 | tee OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.log && touch -c OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP || { rm -fr OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.json OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP; exit 1; }
16:39:06.634 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-LINE-tesseract-ocropy-DEWARP'] output_file_grp=['OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP']
Traceback (most recent call last):
  File "/home/jb/ocrd_all/venv/bin/ocrd-cis-ocropy-recognize", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_recognize())
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/cli.py", line 49, in ocrd_cis_ocropy_recognize
    return ocrd_cli_wrap_processor(OcropyRecognize, *args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/processor/base.py", line 57, in run_processor
    processor.process()
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/recognize.py", line 134, in process
    self.network = load_object(self.get_model(), verbose=1)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/ocrolib/common.py", line 459, in load_object
    return unpickler.load()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 0: ordinal not in range(128)
Makefile:304: recipe for target 'OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP' failed
make: *** [OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP] Error 1
make: Leaving directory '/home/jb/workspace/ocrd/ocrd4dwork'
bertsky commented 4 years ago

Thanks for reporting!

I believe this is an artifact of incomplete Python 2-3 porting. You can avoid it by leaving the file in gzip-compressed form (with .gz extension).

The uncompressed case needs to use the same latin1 encoding IMO.

jbarth-ubhd commented 4 years ago

Tried it, but then ocr-cis-ocropy-recognize does not find the *.pyrnn.gz

bertsky commented 4 years ago

That's odd. Relative paths should be searched:

  1. in __file__'s directory, e.g. venv/lib/python3.6/site-packages/ocrd_cis/ocropy
  2. in __file__'s models subdirectory, e.g. venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models
  3. in any of the directories mentioned in ocrolib.ocropus_find_file:

    Result of searching $fname is the first existing in:
    
        * $base/$fname
        * $base/$fname.gz
        * $base/model/$fname
        * $base/model/$fname.gz
        * $base/data/$fname
        * $base/data/$fname.gz
        * $base/gui/$fname
        * $base/gui/$fname.gz   # if gz
    
    $base can be four base paths:
        * `$OCROPUS_DATA` environment variable
        * current working directory
        * ../../../../share/ocropus from this file's install location
        * `/usr/local/share/ocropus`
        * `$PREFIX/share/ocropus` ($PREFIX being the Python installation
           prefix, usually `/usr`)

3 probably won't help you, because the CWD is the OCR-D workspace directory in the processor's context, and you probably never installed ocropus itself.

So, you should stick with 1 or 2, in the .gz form (until we patched the uncompressed condition).

Perhaps you forgot to also add the .gz suffix in the makefile/parameter file?