Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
Apache License 2.0
1.04k stars 209 forks source link

calmari/1.0: Fix 1.0.x models for Python 3.11 #348

Open mikegerber opened 10 months ago

mikegerber commented 10 months ago

We have old 1.0.x models that wouldn't run using the Calamari 1.0.x branch on Python 3.11, as the replacements use regexen now considered invalid in Python 3.11:

re.error: global flags not at the start of the expression at position 3

E.g. in our 0.ckpt.json:

            {
              "old": "\\s+(?u)",
              "new": " ",
              "regex": true
            },
            {
              "old": "\\n(?u)",
              "regex": true
            },
            {
              "old": "^\\s+(?u)",
              "regex": true
            },
            {
              "old": "\\s+$(?u)",
              "regex": true
            }

The global (?u) regex flag needs to go in front. This script fixes it: https://github.com/OCR-D/ocrd_calamari/blob/master/ocrd_calamari/fix_calamari1_model.py

The question is if you want this "upgrading" procedure to go into the 1.0 branch's modeling loading code?

(I haven't checked any other 1.0 models, but I am somewhat sure that these replacements weren't customized by us and came from Calamari itself.)

mikegerber commented 10 months ago

Related issue in ocrd_calamari is here: https://github.com/OCR-D/ocrd_calamari/issues/91

andbue commented 10 months ago

Hi @mikegerber, I've just made two commits to the 1.0 branch: the first is trying to fix the regex problem and the second to make all the tests run without warning. Could you please test if this works with ocrd_calamari?

mikegerber commented 10 months ago

Hi @mikegerber, I've just made two commits to the 1.0 branch: the first is trying to fix the regex problem and the second to make all the tests run without warning. Could you please test if this works with ocrd_calamari?

Unfortunately this only got called for the default parameters, not the ones read from the model on disk. I've had another look and opened PR #349. That PR fixes the issue for me!

mikegerber commented 10 months ago

(I have 2 other issues with 1.0.x - more NumPy noise and another small issue with noise in the output. If you want to release another 1.0.x version maybe wait a little bit, I still need to investigate if it's Calamari or ocrd_calamari.)

andbue commented 10 months ago

Thanks a lot – I didn't have an old model so I was just guessing where to fix the regexes... If you have any other suggestions, I'll gladly include them in the 1.0.7 release!

mikegerber commented 10 months ago

Yeah I should have linked our historic model so you can reproduce :)

If you need a working model for 1.0 in the future: https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz

(only for old prints/Fraktur)

mikegerber commented 10 months ago

If you have any other suggestions, I'll gladly include them in the 1.0.7 release!

I'll try to debug today!

mikegerber commented 10 months ago

The other issues:

So I think you could release 1.0.7 when #350 is merged and this issue can be closed too :)

mikegerber commented 10 months ago

Nevermind, there's still lots of DeprecationWarnings I'd like to take a look first (other than the ones @andbue thankfully already fixed)

bertsky commented 1 month ago

Sorry @mikegerber – I did not see #350 when releasing 1.0.7.

I'll merge that soon. So perhaps we should do another 1.0.8 ...

More than anything else we urgently need to backport the recently added support for TF SavedModel format to all the older branches – because HDF5 models stop working across the Python 3.8 / 3.9 boundary IIRC. The main problem with that is we cannot just increase the version tag of the older models retroactively (as was done with 5→6 in master). I have discussed this with @andbue and he is inclined to implement the auto-conversion without version update there.

bertsky commented 1 month ago

I'll merge that soon. So perhaps we should do another 1.0.8 ...

done.

I suggest we keep this issue open to track progress with the SavedModel format conversion in 1.x (and the other older branches).