Open jbarth-ubhd opened 1 year ago
Hard to tell from a diff tool that I don't know and data I cannot see. Looks like in ocrd-tesserocr two lines are duplicate.
Binarization will have an impact, yes – both on segmentation and recognition. (For recognition, we don't currently pass the raw images, because we don't know what the model "wants". The only way to ensure raw recognition is either not have binarization in the workflow at all, or removing the respective annotation in the fileGrp that is used as input for recogition, for example via ocrd-page-transform -P xsl page-remove-alternativeimages -P xslt-params "-s which=binarized"
.)
Mind that ocrd-tesserocr-segment plus ocrd-tesserocr-recognize is not recommended as it needlessly throws away internal information of the Tesseract layout analysis. (You can do segmentation and recognition in one pass with ocrd-tesserocr-recognize.)
Standalone Tesseract is another beast entirely. It always uses the raw image for recognition. Also, you can now choose some new adaptive binarization via -c thresholding_method=1
(or 2 for Sauvola). Also comes with its own parameters (thresholding_window_size
, thresholding_kfactor
, thresholding_tile_size
, thresholding_score_fraction
).
ok ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR6 -P segmentation_level region -P textequiv_level word -P find_tables true -P overwrite_segments true -P model Fraktur_GT4HistOCR
gives exactly the same results as tesseract5.3.0 (for this example).
Ok and for completeness tesserocr-segment+calamari+qurator-gt4histocr1.0:
thanks @jbarth-ubhd for checking thoroughly!
BTW, if you want to try any of the better Calamari 2 models here and there (probably also here), you currently have to switch to Calamari 2 on the standalone CLI. (In a OCR-D Workflow, this can be integrated by first exporting line images with ocrd-segment-extract-lines -I SEG -O LINES -P output-types '["text"]'
, then predicting with calamari-predict --checkpoint path/to/best.ckpt.json --data.pred_extension .pred.txt --data.images "LINES/*.png"
and finally importing the text back in via ocrd-segment-replace-text -I SEG -O OCR -P file_glob "LINES/*.pred.txt"
.)
Step 2 does not work (all pip modules installed without any conflict):
> calamari-predict --checkpoint /home/jb/calamari-models-v2/gt4histocr/*.ckpt* --data.images OCR-D-LINES/*.png
2>&1|egrep -vi 'libnv|cuda|Nvidia'|fold -s -w 110
2023-03-02 10:39:28.644398: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is
optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in
performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO 2023-03-02 10:39:30,011 tfaip.device.device_config: Setting up device config
DeviceConfigParams(gpus=None, gpu_auto_tune=False, gpu_memory=None, soft_device_placement=True,
dist_strategy=<DistributionStrategy.DEFAULT: 'default'>)
INFO 2023-03-02 10:39:30,011 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO 2023-03-02 10:39:30,021 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO 2023-03-02 10:39:30,025 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO 2023-03-02 10:39:30,028 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO 2023-03-02 10:39:30,032 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO 2023-03-02 10:39:30,054 tfaip.device.device_config: Setting up device config
DeviceConfigParams(gpus=None, gpu_auto_tune=False, gpu_memory=None, soft_device_placement=True,
dist_strategy=<DistributionStrategy.DEFAULT: 'default'>)
CRITICAL 2023-03-02 10:39:30,061 tfaip.util.logging: Uncaught exception
Traceback (most recent call last):
File "/home/jb/.local/bin/calamari-predict", line 8, in <module>
sys.exit(main())
File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/scripts/predict.py", line 191, in main
run(args.root)
File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/scripts/predict.py", line 119, in run
predictor = MultiPredictor.from_paths(
File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/ocr/predict/predictor.py", line 53, in
from_paths
multi_predictor = super(MultiPredictor, cls).from_paths(
File "/home/jb/.local/lib/python3.8/site-packages/tfaip/predict/multimodelpredictor.py", line 107, in
from_paths
models = [
File "/home/jb/.local/lib/python3.8/site-packages/tfaip/predict/multimodelpredictor.py", line 108, in
<listcomp>
keras.models.load_model(model, compile=False, custom_objects=scenario.model_cls().all_custom_objects())
File "/home/jb/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jb/.local/lib/python3.8/site-packages/keras/utils/generic_utils.py", line 103, in func_load
code = marshal.loads(raw_code)
ValueError: bad marshal data (unknown type code)
BTW the md5sum of all *.json is the same?
jb@pers16:~/calamari-models-v2/gt4histocr> md5sum *
dbcf154171b4d98eea43eebfeb808d2f 0.ckpt.h5
c7c6af20930f84bf2f677e8713f25752 0.ckpt.json
95cc6d142e33f7ac1a3eb44413a71d03 1.ckpt.h5
c7c6af20930f84bf2f677e8713f25752 1.ckpt.json
ed09d330c603958e3c89a0b46218420c 2.ckpt.h5
c7c6af20930f84bf2f677e8713f25752 2.ckpt.json
ec1c9457824c1679e1b4cc2d49343b43 3.ckpt.h5
c7c6af20930f84bf2f677e8713f25752 3.ckpt.json
8c7b560f08625b3f01974199f8a5921a 4.ckpt.h5
c7c6af20930f84bf2f677e8713f25752 4.ckpt.json
But the error message ValueError: bad marshal data (unknown type code)
is the same for deep3...
tesseract5.3.0 -l Fraktur_GT4HistOCR
on manually cropped https://digi.ub.uni-heidelberg.de/diglitData/v/04-manual-crop.tif but perspective not corrected (very minimal perspective distortion)
Note that there are more errors than in the first ocr-comparison-Image here https://github.com/OCR-D/ocrd_all/issues/346#issue-1605237072 , but the base image is almost the same
But the error message
ValueError: bad marshal data (unknown type code)
is the same for deep3...
Ouch. With Python>=3.8 we are now heavily being hit by https://github.com/Calamari-OCR/calamari/issues/78. The solution is to convert the models from HDF5 to SavedFormat, but you need a Python+TF version where it still loads in the first place. As a workaround, you can try in Python 3.7 or 3.6.
on manually cropped [...] but perspective not corrected (very minimal perspective distortion)
Note that there are more errors than in the first ocr-comparison-Image here
Hard to tell. Tesseract LA is very buggy (I would even say fragile) and the legacy code has not been touched (maintained) for years...
quote from Stefan Weil: »We have clear evidence that it is extremely important to have line images for recognition which are similar to those used for training.«"
ok
ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR6 -P segmentation_level region -P textequiv_level word -P find_tables true -P overwrite_segments true -P model Fraktur_GT4HistOCR
gives exactly the same results as tesseract5.3.0 (for this example).quote from Stefan Weil: »We have clear evidence that it is extremely important to have line images for recognition which are similar to those used for training.«"
Yes, that's obviously true. But as a user you have no way of knowing what the model expects (raw or bin, what kind of bin). There's no model metadata in Tesseract. (And in Calamari, it could be stored in the model metadata, but the trainer does not do that.)
The model's publisher (in this case, @stweil) must document what the model was trained on (both what kind of material and in what digital form). The Tesseract models from Mannheim are usually documented on the tesstrain Wiki. Their Kraken models however point to the wiki pages of the respective GT repos.
I'm just wondering a bit about different recognition results using tesseract5.3.0 and OCR-D with
ocrd-olena-binarize && ocrd-tesserocr-segment
.Original TIF: https://digi.ub.uni-heidelberg.de/diglitData/v/heidelberg1592_-_04manual.tif
Result using tesseract5.3.0
-l Fraktur_GT4Hist...
(right column = ground truth)and using tesserocr-segment and calamari-recognize (
fraktur_historical1.0
) with OCR-D:and using tesserocr-segment and tesserocr-recognize (
Fraktur_GT4Hist...
) with OCR-D:It seems that OCR-D-"tesserocr" segmentation is somewhat different to OCR-D segmentation (perhaps because olena-binarize?), but I can't find a big change in line/region/segmentation etc. in the tesseract changelog the last year.