OCR-D / ocrd_calamari

Recognize text using Calamari OCR and the OCR-D framework
Apache License 2.0
13 stars 6 forks source link

Calamari 2.2 #61

Open mikegerber opened 3 years ago

mikegerber commented 3 years ago

Calamari 2.0 is out.

I don't see benefits from updating the dependency, other than staying uptodate/compatible.

kba commented 3 years ago

@maxnth @andbue @chwick can answer this more accurately but there were quite a few refactorings and performance enhancements since the last 1.x release v1.0.5. It is not time-critical to upgrade as soon as possible but I think it would be good to get started assessing what the benefits are, what has changed and how to proceed with adapting.

mikegerber commented 3 years ago

54 is possibly related

andbue commented 3 years ago

Hi, I don't think it should be too complicated to update ocr_calamari to 2.0. The whole preprocessing stuff is loaded according to the definition in the model, so I was wrong to assume you're somehow circumventing it – sorry about that! The cleanest and probably most efficient way would be to implement a custom DataReader class that somehow handles the workspace data. I'm not sure, however, if your data classes can be pickled and sent to worker threads without problems.

If it helps, you could have a look at my client that caches preprocessed lines in a hdf5 file. I wrote a reader and included it in the DataReaderFactory here. Prediction happens here. Other than setting up the dataset and the removal of the preprocessing (which you should avoid), this is in most parts taken from predict.py anyway.

bertsky commented 3 years ago

Just wanted to note that Calamari 2 depends on tfaip which requires Python >= 3.7 – which would remove support for the default Python version on OCR-D's (still) default target Ubuntu 18.

mikegerber commented 2 years ago

I have opened https://github.com/Calamari-OCR/calamari/issues/304 because Calamari 2.1.x depended on TF 2.4.x (PyPI-incompatible with Python 3.9...), but @andbue already updated Calamari 2.2(!) to remove this restriction. 👍

With this TF version hell I think I'll first update the test rigging to test on all Python versions 3.7-3.9 (maybe even 3.10).

mikegerber commented 2 years ago

Heads up: I'm working on this

bertsky commented 2 years ago

@mikegerber, do already have something you could share (as a feature branch)? I guess there are multiple API changes to cope with, plus perhaps a need to deal with model migration?

stefanCCS commented 2 years ago

Hi everybody, any update concerning this, as OCR-D now supports Python 3.7 ?

mikegerber commented 2 years ago

Sorry for not keeping anyone up to date: I plan to work on this further in the coming week!

bertsky commented 2 years ago

Do you have any news for us @mikegerber? Have you looked at the native PAGE-XML output of Calamari 2 – is it re/usable?

mikegerber commented 2 years ago

Sorry... I have been neglecting this. I try to finish this soon after my vacation. PRs welcome though, if they come in the meantime

mikegerber commented 1 year ago

Combination of bad time management and serious illness (for months!) and the following back log lead to more delay...

mikegerber commented 1 year ago

Blocked by #84 (the GPL issues), as it seems. 🙄

mikegerber commented 1 year ago

@bertsky in #87 (https://github.com/OCR-D/ocrd_calamari/issues/87#issuecomment-1680546997):

I would say that 2.x support is quite urgent (because most/best models are trained on 2.x). Given that Calamari 2.x now has good native PAGE support, this should actually be easy IIUC.

We have 2 workarounds for that:

1. extracting line pairs via ocrd-segment-extract-lines, running the 2.x calamari-predict on them, and then re-importing with ocrd-segment-replace-text
   ```
    ocrd-segment-extract-lines -I $IGRP -O LINES
    calamari-predict --pipeline.num_processes 4 --checkpoint /path/to/\*.json --data.images "LINES/*.png"
    ocrd-segment-replace-text -I $IGRP -O $OGRP -P file_glob "LINES/*.pred.txt"
   ```

2. running the 2.x calamari-predict on the PAGE files directly and then reimporting the resulting PAGE files into the METS via bulk-add
   ```
    calamari-predict --checkpoint /path/to/deep3_lsh4/\*.json --data PageXML --data.xml_files "$IGRP/*.xml" --data.images "$IMGGRP/*.png" --data.output_glyphs True --data.max_glyph_alternatives 5 --data.output_confidences True
    ocrd workspace find -m application/vnd.prima.page+xml -G $IGRP -k page_id -k file_id -k url | while read page_id file_id url; do out=${url%.xml}.pred.xml; file_id=${file_id//$IGRP/$OGRP}; url=${url//$IGRP/$OGRP}; url=${url//pred.}; mv $out $url; echo $page_id $file_id $url; done | ocrd workspace bulk-add -r '(?P<pageid>.*) (?P<fileid>.*) (?P<url>.*)' -G $OGRP -g '{{ pageid }}' -i '{{ fileid }}' -S '{{ url }}' -
   ```

But in both cases we loose any information below the line level including confidence, and we get no model provenance here). Also, with these recipes we cannot use the regular specialised workflow formats.

This is all interesting, but does Calamari 2.x now have a valid license? Otherwise this is still blocked and I will not work on this.

bertsky commented 1 year ago

This is all interesting, but does Calamari 2.x now have a valid license? Otherwise this is still blocked and I will not work on this.

I don't get why the switch to GPL would be such a blocker. Is it for you personally or by requirement?

Anyway, if you give me a definite answer then I could decide if I want to take over from here.

mikegerber commented 1 year ago

There are multiple issues:

IF Calamari 2.x

  1. CAN relicense to GPL (see https://github.com/Calamari-OCR/calamari/issues/3)
  2. and they DO relicense to GPL

THEN we could relicense to GPL. IF my employer agrees. (Personally I would agree to this. Not sure if that even matters if I agree.)

(Calamari 1.x is fine as it does not use tfaip.)

mikegerber commented 1 year ago

I'll discuss abandoning maintainership of ocrd_calamari with @cneud, but this is going to wait until at least October (I have major surgery in September).

The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.

bertsky commented 1 year ago

The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.

Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?

mikegerber commented 1 year ago

The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.

Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?

I'm not a lawyer, and perhaps we should discuss this (after my health stuff :)) in a video call soon, this is how I see the situation.

a. Calamari 2.x's license is invalid. It simply can't have an Apache license while using the GPL library tfaip.* b. Therefore - I believe - I can't use it as a user (or depend on it, as a developer, in my "own" project ocrd_calamari)

In the hypothetical situation of Calamari going GPL**, I personally do not have a problem with a GPL'ed ocrd_calamari. There's some potentially blocking red tape involved (my employer and all contributors must agree), but I - as one of the main contributors - would do it.

* This is clear IMHO, and it doesn't matter that it's Python and not using ld ** If it can, given legacy licensing stuff from Kraken(?) or whatever the issue was

mikegerber commented 1 year ago

(Deliberately avoiding terms like "enforcing", I think that was used in the wrong way in discussions

I'd also like stress I do not care that much about licenses, it just seems to be a serious and show-stopping "bug" that there's this problem. As long as it's open source it doesn't matter to me, except that I would avoid GPL for library code, because it would cause exactly this kind of legal situation.)

mikegerber commented 1 year ago

Another way around it: Isolate GPLy code. If you just os.system('gpl-binary') then you also have no problem, when gpl-binary is GPL. Maybe worth checking, but it will be a pain to maintain properly.

mikegerber commented 1 year ago

The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it. Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?

Short answer: Because the license is invalid. If it were GPL there would be the possibility of us (the developers of ocrd_calamari) to update IF we move to GPL too.

bertsky commented 1 month ago

Thanks for being so precise and sharing your concerns!

I suggest we try to convince @andbue and @chreul to go GPL with Calamari-OCR, and then proceed with the OCR-D wrapper for 2.x here.

Native PAGE-XML support (via dataset type for input and output) does help, but I'm not sure how we can ensure that OCR-D's incremental annotation principle can be guaranteed – we must not throw away information, even if it's irrelevant to the OCR. Also complicating the matter is the fact that OCR-D requires using to AlternativeImage on all hierarchy levels and adhering to @orientation etc.

Can you please elaborate on the state of your migration (esp. around these issues) so far (or back when you were working on it)?

mikegerber commented 1 month ago

Thanks for being so precise and sharing your concerns!

I suggest we try to convince @andbue and @chreul to go GPL with Calamari-OCR, and then proceed with the OCR-D wrapper for 2.x here.

There seems to be another issue with that: https://github.com/Calamari-OCR/calamari/issues/3

mikegerber commented 1 month ago

Can you please elaborate on the state of your migration (esp. around these issues) so far (or back when you were working on it)?

This 100% blocked by these licensing issues, I will not work on it further until these are resolved.

bertsky commented 1 month ago

There seems to be another issue with that: Calamari-OCR/calamari#3

Like I said on that thread:

I also don't think the licensing deviation from Ocropy is of concern. Calamari by being GPLed cannot in any way violate Apache'd old Ocropy.

So it's not another issue AFAICS.

This 100% blocked by these licensing issues, I will not work on it further until these are resolved.

I understood that much, but I would really like to know how far you got so far. (It would help in gauging what's the best way to proceed currently – bringing TF SavedModel format to old Calamari versions vs. bringing OCR-D to 2.x next.)