OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
72 stars 17 forks source link

segmentation: tesseract5.3.0 vs ocrd/all:2022-08-15 #346

Open jbarth-ubhd opened 1 year ago

jbarth-ubhd commented 1 year ago

I'm just wondering a bit about different recognition results using tesseract5.3.0 and OCR-D with ocrd-olena-binarize && ocrd-tesserocr-segment.

Original TIF: https://digi.ub.uni-heidelberg.de/diglitData/v/heidelberg1592_-_04manual.tif

Result using tesseract5.3.0 -l Fraktur_GT4Hist... (right column = ground truth) image

and using tesserocr-segment and calamari-recognize (fraktur_historical1.0) with OCR-D: image

and using tesserocr-segment and tesserocr-recognize (Fraktur_GT4Hist...) with OCR-D: image

It seems that OCR-D-"tesserocr" segmentation is somewhat different to OCR-D segmentation (perhaps because olena-binarize?), but I can't find a big change in line/region/segmentation etc. in the tesseract changelog the last year.

bertsky commented 1 year ago

Hard to tell from a diff tool that I don't know and data I cannot see. Looks like in ocrd-tesserocr two lines are duplicate.

Binarization will have an impact, yes – both on segmentation and recognition. (For recognition, we don't currently pass the raw images, because we don't know what the model "wants". The only way to ensure raw recognition is either not have binarization in the workflow at all, or removing the respective annotation in the fileGrp that is used as input for recogition, for example via ocrd-page-transform -P xsl page-remove-alternativeimages -P xslt-params "-s which=binarized".)

Mind that ocrd-tesserocr-segment plus ocrd-tesserocr-recognize is not recommended as it needlessly throws away internal information of the Tesseract layout analysis. (You can do segmentation and recognition in one pass with ocrd-tesserocr-recognize.)

Standalone Tesseract is another beast entirely. It always uses the raw image for recognition. Also, you can now choose some new adaptive binarization via -c thresholding_method=1 (or 2 for Sauvola). Also comes with its own parameters (thresholding_window_size, thresholding_kfactor, thresholding_tile_size, thresholding_score_fraction).

jbarth-ubhd commented 1 year ago

ok ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR6 -P segmentation_level region -P textequiv_level word -P find_tables true -P overwrite_segments true -P model Fraktur_GT4HistOCR gives exactly the same results as tesseract5.3.0 (for this example).

jbarth-ubhd commented 1 year ago

Ok and for completeness tesserocr-segment+calamari+qurator-gt4histocr1.0:

image

bertsky commented 1 year ago

thanks @jbarth-ubhd for checking thoroughly!

BTW, if you want to try any of the better Calamari 2 models here and there (probably also here), you currently have to switch to Calamari 2 on the standalone CLI. (In a OCR-D Workflow, this can be integrated by first exporting line images with ocrd-segment-extract-lines -I SEG -O LINES -P output-types '["text"]', then predicting with calamari-predict --checkpoint path/to/best.ckpt.json --data.pred_extension .pred.txt --data.images "LINES/*.png" and finally importing the text back in via ocrd-segment-replace-text -I SEG -O OCR -P file_glob "LINES/*.pred.txt".)

jbarth-ubhd commented 1 year ago

Step 2 does not work (all pip modules installed without any conflict):

> calamari-predict --checkpoint /home/jb/calamari-models-v2/gt4histocr/*.ckpt* --data.images OCR-D-LINES/*.png 
2>&1|egrep -vi 'libnv|cuda|Nvidia'|fold -s -w 110

2023-03-02 10:39:28.644398: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is 
optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in 
performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO     2023-03-02 10:39:30,011     tfaip.device.device_config: Setting up device config 
DeviceConfigParams(gpus=None, gpu_auto_tune=False, gpu_memory=None, soft_device_placement=True, 
dist_strategy=<DistributionStrategy.DEFAULT: 'default'>)
INFO     2023-03-02 10:39:30,011 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,021 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,025 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,028 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,032 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,054     tfaip.device.device_config: Setting up device config 
DeviceConfigParams(gpus=None, gpu_auto_tune=False, gpu_memory=None, soft_device_placement=True, 
dist_strategy=<DistributionStrategy.DEFAULT: 'default'>)
CRITICAL 2023-03-02 10:39:30,061             tfaip.util.logging: Uncaught exception
Traceback (most recent call last):
  File "/home/jb/.local/bin/calamari-predict", line 8, in <module>
    sys.exit(main())
  File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/scripts/predict.py", line 191, in main
    run(args.root)
  File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/scripts/predict.py", line 119, in run
    predictor = MultiPredictor.from_paths(
  File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/ocr/predict/predictor.py", line 53, in 
from_paths
    multi_predictor = super(MultiPredictor, cls).from_paths(
  File "/home/jb/.local/lib/python3.8/site-packages/tfaip/predict/multimodelpredictor.py", line 107, in 
from_paths
    models = [
  File "/home/jb/.local/lib/python3.8/site-packages/tfaip/predict/multimodelpredictor.py", line 108, in 
<listcomp>
    keras.models.load_model(model, compile=False, custom_objects=scenario.model_cls().all_custom_objects())
  File "/home/jb/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jb/.local/lib/python3.8/site-packages/keras/utils/generic_utils.py", line 103, in func_load
    code = marshal.loads(raw_code)
ValueError: bad marshal data (unknown type code)

BTW the md5sum of all *.json is the same?

jb@pers16:~/calamari-models-v2/gt4histocr> md5sum *
dbcf154171b4d98eea43eebfeb808d2f  0.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  0.ckpt.json
95cc6d142e33f7ac1a3eb44413a71d03  1.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  1.ckpt.json
ed09d330c603958e3c89a0b46218420c  2.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  2.ckpt.json
ec1c9457824c1679e1b4cc2d49343b43  3.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  3.ckpt.json
8c7b560f08625b3f01974199f8a5921a  4.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  4.ckpt.json

But the error message ValueError: bad marshal data (unknown type code) is the same for deep3...

jbarth-ubhd commented 1 year ago

tesseract5.3.0 -l Fraktur_GT4HistOCR on manually cropped https://digi.ub.uni-heidelberg.de/diglitData/v/04-manual-crop.tif but perspective not corrected (very minimal perspective distortion)

Note that there are more errors than in the first ocr-comparison-Image here https://github.com/OCR-D/ocrd_all/issues/346#issue-1605237072 , but the base image is almost the same

grafik

bertsky commented 1 year ago

But the error message ValueError: bad marshal data (unknown type code) is the same for deep3...

Ouch. With Python>=3.8 we are now heavily being hit by https://github.com/Calamari-OCR/calamari/issues/78. The solution is to convert the models from HDF5 to SavedFormat, but you need a Python+TF version where it still loads in the first place. As a workaround, you can try in Python 3.7 or 3.6.

bertsky commented 1 year ago

on manually cropped [...] but perspective not corrected (very minimal perspective distortion)

Note that there are more errors than in the first ocr-comparison-Image here

Hard to tell. Tesseract LA is very buggy (I would even say fragile) and the legacy code has not been touched (maintained) for years...

jbarth-ubhd commented 1 year ago

quote from Stefan Weil: »We have clear evidence that it is extremely important to have line images for recognition which are similar to those used for training.«"

bertsky commented 1 year ago

ok ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR6 -P segmentation_level region -P textequiv_level word -P find_tables true -P overwrite_segments true -P model Fraktur_GT4HistOCR gives exactly the same results as tesseract5.3.0 (for this example).

quote from Stefan Weil: »We have clear evidence that it is extremely important to have line images for recognition which are similar to those used for training.«"

Yes, that's obviously true. But as a user you have no way of knowing what the model expects (raw or bin, what kind of bin). There's no model metadata in Tesseract. (And in Calamari, it could be stored in the model metadata, but the trainer does not do that.)

The model's publisher (in this case, @stweil) must document what the model was trained on (both what kind of material and in what digital form). The Tesseract models from Mannheim are usually documented on the tesstrain Wiki. Their Kraken models however point to the wiki pages of the respective GT repos.