Open mikegerber opened 3 years ago
@maxnth @andbue @chwick can answer this more accurately but there were quite a few refactorings and performance enhancements since the last 1.x release v1.0.5. It is not time-critical to upgrade as soon as possible but I think it would be good to get started assessing what the benefits are, what has changed and how to proceed with adapting.
Hi, I don't think it should be too complicated to update ocr_calamari to 2.0. The whole preprocessing stuff is loaded according to the definition in the model, so I was wrong to assume you're somehow circumventing it – sorry about that! The cleanest and probably most efficient way would be to implement a custom DataReader class that somehow handles the workspace data. I'm not sure, however, if your data classes can be pickled and sent to worker threads without problems.
If it helps, you could have a look at my client that caches preprocessed lines in a hdf5 file. I wrote a reader and included it in the DataReaderFactory here. Prediction happens here. Other than setting up the dataset and the removal of the preprocessing (which you should avoid), this is in most parts taken from predict.py anyway.
Just wanted to note that Calamari 2 depends on tfaip which requires Python >= 3.7 – which would remove support for the default Python version on OCR-D's (still) default target Ubuntu 18.
I have opened https://github.com/Calamari-OCR/calamari/issues/304 because Calamari 2.1.x depended on TF 2.4.x (PyPI-incompatible with Python 3.9...), but @andbue already updated Calamari 2.2(!) to remove this restriction. 👍
With this TF version hell I think I'll first update the test rigging to test on all Python versions 3.7-3.9 (maybe even 3.10).
Heads up: I'm working on this
@mikegerber, do already have something you could share (as a feature branch)? I guess there are multiple API changes to cope with, plus perhaps a need to deal with model migration?
Hi everybody, any update concerning this, as OCR-D now supports Python 3.7 ?
Sorry for not keeping anyone up to date: I plan to work on this further in the coming week!
Do you have any news for us @mikegerber? Have you looked at the native PAGE-XML output of Calamari 2 – is it re/usable?
Sorry... I have been neglecting this. I try to finish this soon after my vacation. PRs welcome though, if they come in the meantime
Combination of bad time management and serious illness (for months!) and the following back log lead to more delay...
Blocked by #84 (the GPL issues), as it seems. 🙄
@bertsky in #87 (https://github.com/OCR-D/ocrd_calamari/issues/87#issuecomment-1680546997):
I would say that 2.x support is quite urgent (because most/best models are trained on 2.x). Given that Calamari 2.x now has good native PAGE support, this should actually be easy IIUC.
We have 2 workarounds for that:
1. extracting line pairs via ocrd-segment-extract-lines, running the 2.x calamari-predict on them, and then re-importing with ocrd-segment-replace-text ``` ocrd-segment-extract-lines -I $IGRP -O LINES calamari-predict --pipeline.num_processes 4 --checkpoint /path/to/\*.json --data.images "LINES/*.png" ocrd-segment-replace-text -I $IGRP -O $OGRP -P file_glob "LINES/*.pred.txt" ``` 2. running the 2.x calamari-predict on the PAGE files directly and then reimporting the resulting PAGE files into the METS via bulk-add ``` calamari-predict --checkpoint /path/to/deep3_lsh4/\*.json --data PageXML --data.xml_files "$IGRP/*.xml" --data.images "$IMGGRP/*.png" --data.output_glyphs True --data.max_glyph_alternatives 5 --data.output_confidences True ocrd workspace find -m application/vnd.prima.page+xml -G $IGRP -k page_id -k file_id -k url | while read page_id file_id url; do out=${url%.xml}.pred.xml; file_id=${file_id//$IGRP/$OGRP}; url=${url//$IGRP/$OGRP}; url=${url//pred.}; mv $out $url; echo $page_id $file_id $url; done | ocrd workspace bulk-add -r '(?P<pageid>.*) (?P<fileid>.*) (?P<url>.*)' -G $OGRP -g '{{ pageid }}' -i '{{ fileid }}' -S '{{ url }}' - ```
But in both cases we loose any information below the line level including confidence, and we get no model provenance here). Also, with these recipes we cannot use the regular specialised workflow formats.
This is all interesting, but does Calamari 2.x now have a valid license? Otherwise this is still blocked and I will not work on this.
This is all interesting, but does Calamari 2.x now have a valid license? Otherwise this is still blocked and I will not work on this.
I don't get why the switch to GPL would be such a blocker. Is it for you personally or by requirement?
Anyway, if you give me a definite answer then I could decide if I want to take over from here.
There are multiple issues:
IF Calamari 2.x
THEN we could relicense to GPL. IF my employer agrees. (Personally I would agree to this. Not sure if that even matters if I agree.)
(Calamari 1.x is fine as it does not use tfaip.)
I'll discuss abandoning maintainership of ocrd_calamari with @cneud, but this is going to wait until at least October (I have major surgery in September).
The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.
The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.
Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?
The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.
Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?
I'm not a lawyer, and perhaps we should discuss this (after my health stuff :)) in a video call soon, this is how I see the situation.
a. Calamari 2.x's license is invalid. It simply can't have an Apache license while using the GPL library tfaip.* b. Therefore - I believe - I can't use it as a user (or depend on it, as a developer, in my "own" project ocrd_calamari)
In the hypothetical situation of Calamari going GPL**, I personally do not have a problem with a GPL'ed ocrd_calamari. There's some potentially blocking red tape involved (my employer and all contributors must agree), but I - as one of the main contributors - would do it.
* This is clear IMHO, and it doesn't matter that it's Python and not using ld
** If it can, given legacy licensing stuff from Kraken(?) or whatever the issue was
(Deliberately avoiding terms like "enforcing", I think that was used in the wrong way in discussions
I'd also like stress I do not care that much about licenses, it just seems to be a serious and show-stopping "bug" that there's this problem. As long as it's open source it doesn't matter to me, except that I would avoid GPL for library code, because it would cause exactly this kind of legal situation.)
Another way around it: Isolate GPLy code. If you just os.system('gpl-binary') then you also have no problem, when gpl-binary is GPL. Maybe worth checking, but it will be a pain to maintain properly.
The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it. Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?
Short answer: Because the license is invalid. If it were GPL there would be the possibility of us (the developers of ocrd_calamari) to update IF we move to GPL too.
Thanks for being so precise and sharing your concerns!
I suggest we try to convince @andbue and @chreul to go GPL with Calamari-OCR, and then proceed with the OCR-D wrapper for 2.x here.
Native PAGE-XML support (via dataset type for input and output) does help, but I'm not sure how we can ensure that OCR-D's incremental annotation principle can be guaranteed – we must not throw away information, even if it's irrelevant to the OCR. Also complicating the matter is the fact that OCR-D requires using to AlternativeImage on all hierarchy levels and adhering to @orientation
etc.
Can you please elaborate on the state of your migration (esp. around these issues) so far (or back when you were working on it)?
Thanks for being so precise and sharing your concerns!
I suggest we try to convince @andbue and @chreul to go GPL with Calamari-OCR, and then proceed with the OCR-D wrapper for 2.x here.
There seems to be another issue with that: https://github.com/Calamari-OCR/calamari/issues/3
Can you please elaborate on the state of your migration (esp. around these issues) so far (or back when you were working on it)?
This 100% blocked by these licensing issues, I will not work on it further until these are resolved.
There seems to be another issue with that: Calamari-OCR/calamari#3
Like I said on that thread:
I also don't think the licensing deviation from Ocropy is of concern. Calamari by being GPLed cannot in any way violate Apache'd old Ocropy.
So it's not another issue AFAICS.
This 100% blocked by these licensing issues, I will not work on it further until these are resolved.
I understood that much, but I would really like to know how far you got so far. (It would help in gauging what's the best way to proceed currently – bringing TF SavedModel format to old Calamari versions vs. bringing OCR-D to 2.x next.)
Calamari 2.0 is out.
I don't see benefits from updating the dependency, other than staying uptodate/compatible.