Closed mikegerber closed 10 months ago
The deprecation of np.str
(and similar) has expired in NumPy 1.24.0; i.e. it now throws an error.
Until we can update to Calamari 2, I'll require NumPy < 1.24.0 as a workaround.
Managed to get it to work, using numpy 1.23.5. CircleCI is still failing.
The filename of the created file seems to have changed in OCR-D so the tests fail. I'm correcting it. (This is unrelated to the NumPy problem. But illustrates why scheduled tests seems to be useful.)
Note: on calamari/1.0
branch, this had already been fixed – perhaps we just need another release?
Note: on
calamari/1.0
branch, this had already been fixed – perhaps we just need another release?
Thanks for making me aware of this open issue again! I will look into it!
Is there an urgent issue in ocrd_all with no working workaround?
Note: on
calamari/1.0
branch, this had already been fixed – perhaps we just need another release?Thanks for making me aware of this open issue again! I will look into it!
Also note https://github.com/Calamari-OCR/calamari/pull/341, which hopefully will be enough to make ocrd_calamari 1.0.5 (before the workarounds) work out of the box.
Is there an urgent issue in ocrd_all with no working workaround?
I would say that 2.x support is quite urgent (because most/best models are trained on 2.x). Given that Calamari 2.x now has good native PAGE support, this should actually be easy IIUC.
We have 2 workarounds for that:
extracting line pairs via ocrd-segment-extract-lines, running the 2.x calamari-predict on them, and then re-importing with ocrd-segment-replace-text
ocrd-segment-extract-lines -I $IGRP -O LINES
calamari-predict --pipeline.num_processes 4 --checkpoint /path/to/\*.json --data.images "LINES/*.png"
ocrd-segment-replace-text -I $IGRP -O $OGRP -P file_glob "LINES/*.pred.txt"
running the 2.x calamari-predict on the PAGE files directly and then reimporting the resulting PAGE files into the METS via bulk-add
calamari-predict --checkpoint /path/to/deep3_lsh4/\*.json --data PageXML --data.xml_files "$IGRP/*.xml" --data.images "$IMGGRP/*.png" --data.output_glyphs True --data.max_glyph_alternatives 5 --data.output_confidences True
ocrd workspace find -m application/vnd.prima.page+xml -G $IGRP -k page_id -k file_id -k url | while read page_id file_id url; do out=${url%.xml}.pred.xml; file_id=${file_id//$IGRP/$OGRP}; url=${url//$IGRP/$OGRP}; url=${url//pred.}; mv $out $url; echo $page_id $file_id $url; done | ocrd workspace bulk-add -r '(?P<pageid>.*) (?P<fileid>.*) (?P<url>.*)' -G $OGRP -g '{{ pageid }}' -i '{{ fileid }}' -S '{{ url }}' -
But in both cases we loose any information below the line level including confidence, and we get no model provenance here). Also, with these recipes we cannot use the regular specialised workflow formats.
I would say that 2.x support is quite urgent (because most/best models are trained on 2.x). Given that Calamari 2.x now has good native PAGE support, this should actually be easy IIUC.
Moving this to #61.
Thanks for making me aware of this open issue again! I will look into it!
Also note Calamari-OCR/calamari#341, which hopefully will be enough to make ocrd_calamari 1.0.5 (before the workarounds) work out of the box.
This has now happened: there is a new calamari-ocr==1.0.6 which already takes care of the Numpy and Protobuf problems, so you could rewrite ocrd_calamari to basically what we had before these workarounds and release a ocrd_calamari==1.0.6.post1
or so.
Removing the numpy workaround now as it seems to be fixed in calamari-ocr indeed. I get another warning though using Python 3.11 and the latest numpy... Need to investigate.
❯ make test
[ ... ]
========================================== warnings summary ==========================================
../../.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/numpy/core/getlimits.py:542
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/numpy/core/getlimits.py:542: UserWarning: Signature b'\x00\xd0\xcc\xcc\xcc\xcc\xcc\xcc\xfb\xbf\x00\x00\x00\x00\x00\x00' for <class 'numpy.longdouble'> does not match any known type: falling back to type probe function.
This warnings indicates broken support for the dtype!
machar = _get_machar(dtype)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================== 4 passed, 1 warning in 267.15s (0:04:27) ==============================
This issue can be closed. There's
Using
I get this error: