OCR-D / ocrd_calamari

Recognize text using Calamari OCR and the OCR-D framework
Apache License 2.0
13 stars 6 forks source link

AttributeError: module 'numpy' has no attribute 'str'. #87

Closed mikegerber closed 10 months ago

mikegerber commented 1 year ago

Using

I get this error:

18:00:28.245 INFO processor.CalamariRecognize - About to recognize 27 lines of region 'r13'
/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/calamari_ocr/ocr/backends/tensorflow_backend/tensorflow_model.py:338: FutureWarning: In the future `np.str` will be defined as the corresponding NumPy scalar.
  [x / 255, len_x, np.zeros((len(x), 1), dtype=np.str)],
Traceback (most recent call last):
  File "/home/b-mg106/.pyenv/versions/ocrd_calamari/bin/ocrd-calamari-recognize", line 33, in <module>
    sys.exit(load_entry_point('ocrd-calamari', 'console_scripts', 'ocrd-calamari-recognize')())
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/b-mg106/devel/ocrd_calamari/ocrd_calamari/cli.py", line 13, in ocrd_calamari_recognize
    return ocrd_cli_wrap_processor(CalamariRecognize, *args, **kwargs)
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/ocrd/processor/helpers.py", line 121, in run_processor
    processor.process()
  File "/home/b-mg106/devel/ocrd_calamari/ocrd_calamari/recognize.py", line 121, in process
    for line, line_coords, raw_results in zip(textlines, line_coordss, raw_results_all):
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/calamari_ocr/ocr/predictor.py", line 250, in predict_raw
    for result in zip(*prediction):
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/calamari_ocr/ocr/predictor.py", line 166, in predict_raw
    for p, ip in zip(self.network.predict_raw(input_images), input_params):
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/calamari_ocr/ocr/backends/model_interface.py", line 62, in predict_raw
    for r in self.predict_raw_batch(*self.zero_padding(x)):
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/calamari_ocr/ocr/backends/tensorflow_backend/tensorflow_model.py", line 338, in predict_raw_batch
    [x / 255, len_x, np.zeros((len(x), 1), dtype=np.str)],
  File "/home/b-mg106/.pyenv/versions/3.9.16/envs/ocrd_calamari/lib/python3.9/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'str'.
`np.str` was a deprecated alias for the builtin `str`. To avoid this error in existing code, use `str` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.str_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
mikegerber commented 1 year ago

The deprecation of np.str (and similar) has expired in NumPy 1.24.0; i.e. it now throws an error.

Until we can update to Calamari 2, I'll require NumPy < 1.24.0 as a workaround.

mikegerber commented 1 year ago

Managed to get it to work, using numpy 1.23.5. CircleCI is still failing.

mikegerber commented 1 year ago

The filename of the created file seems to have changed in OCR-D so the tests fail. I'm correcting it. (This is unrelated to the NumPy problem. But illustrates why scheduled tests seems to be useful.)

bertsky commented 1 year ago

Note: on calamari/1.0 branch, this had already been fixed – perhaps we just need another release?

mikegerber commented 1 year ago

Note: on calamari/1.0 branch, this had already been fixed – perhaps we just need another release?

Thanks for making me aware of this open issue again! I will look into it!

Is there an urgent issue in ocrd_all with no working workaround?

bertsky commented 1 year ago

Note: on calamari/1.0 branch, this had already been fixed – perhaps we just need another release?

Thanks for making me aware of this open issue again! I will look into it!

Also note https://github.com/Calamari-OCR/calamari/pull/341, which hopefully will be enough to make ocrd_calamari 1.0.5 (before the workarounds) work out of the box.

Is there an urgent issue in ocrd_all with no working workaround?

I would say that 2.x support is quite urgent (because most/best models are trained on 2.x). Given that Calamari 2.x now has good native PAGE support, this should actually be easy IIUC.

We have 2 workarounds for that:

  1. extracting line pairs via ocrd-segment-extract-lines, running the 2.x calamari-predict on them, and then re-importing with ocrd-segment-replace-text

    ocrd-segment-extract-lines -I $IGRP -O LINES
    calamari-predict --pipeline.num_processes 4 --checkpoint /path/to/\*.json --data.images "LINES/*.png"
    ocrd-segment-replace-text -I $IGRP -O $OGRP -P file_glob "LINES/*.pred.txt"
  2. running the 2.x calamari-predict on the PAGE files directly and then reimporting the resulting PAGE files into the METS via bulk-add

    calamari-predict --checkpoint /path/to/deep3_lsh4/\*.json --data PageXML --data.xml_files "$IGRP/*.xml" --data.images "$IMGGRP/*.png" --data.output_glyphs True --data.max_glyph_alternatives 5 --data.output_confidences True
    ocrd workspace find -m application/vnd.prima.page+xml -G $IGRP -k page_id -k file_id -k url | while read page_id file_id url; do out=${url%.xml}.pred.xml; file_id=${file_id//$IGRP/$OGRP}; url=${url//$IGRP/$OGRP}; url=${url//pred.}; mv $out $url; echo $page_id $file_id $url; done | ocrd workspace bulk-add -r '(?P<pageid>.*) (?P<fileid>.*) (?P<url>.*)' -G $OGRP -g '{{ pageid }}' -i '{{ fileid }}' -S '{{ url }}' -

But in both cases we loose any information below the line level including confidence, and we get no model provenance here). Also, with these recipes we cannot use the regular specialised workflow formats.

mikegerber commented 1 year ago

I would say that 2.x support is quite urgent (because most/best models are trained on 2.x). Given that Calamari 2.x now has good native PAGE support, this should actually be easy IIUC.

Moving this to #61.

bertsky commented 1 year ago

Thanks for making me aware of this open issue again! I will look into it!

Also note Calamari-OCR/calamari#341, which hopefully will be enough to make ocrd_calamari 1.0.5 (before the workarounds) work out of the box.

This has now happened: there is a new calamari-ocr==1.0.6 which already takes care of the Numpy and Protobuf problems, so you could rewrite ocrd_calamari to basically what we had before these workarounds and release a ocrd_calamari==1.0.6.post1 or so.

mikegerber commented 11 months ago

Removing the numpy workaround now as it seems to be fixed in calamari-ocr indeed. I get another warning though using Python 3.11 and the latest numpy... Need to investigate.

❯ make test
[ ... ]
========================================== warnings summary ==========================================
../../.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/numpy/core/getlimits.py:542
  /home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/numpy/core/getlimits.py:542: UserWarning: Signature b'\x00\xd0\xcc\xcc\xcc\xcc\xcc\xcc\xfb\xbf\x00\x00\x00\x00\x00\x00' for <class 'numpy.longdouble'> does not match any known type: falling back to type probe function.
  This warnings indicates broken support for the dtype!
    machar = _get_machar(dtype)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================== 4 passed, 1 warning in 267.15s (0:04:27) ==============================
mikegerber commented 10 months ago

This issue can be closed. There's