ocrd-tesserocr-crop: 22.5h processing time

jbarth-ubhd commented 6 months ago

Processing the image in OCR-D-IMG in https://digi.ub.uni-heidelberg.de/diglitData/v/valentini1714bd2_-_0000036v_aqv_Tabula_ROEM_0007.zip took about 22.5h @ Core i7-4790 CPU 3.60GHz — workflow see below & run-docker.sh:

docker-ocrd ocrd workspace init
docker-ocrd ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif

docker-ocrd ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-IMG -O OCR-D-001
docker-ocrd ocrd-tesserocr-crop -I OCR-D-001 -O OCR-D-002
docker-ocrd ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-002 -O OCR-D-003
docker-ocrd ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004
docker-ocrd ocrd-tesserocr-recognize -P find_tables true -P segmentation_level region -P textequiv_level word -P model frak2021 -I OCR-D-004 -O OCR-D-OCR

jb@xxx:~/valentini1714bd2/0000036v_aqv_Tabula_ROEM_0007> ls -1rtd OCR-D* | awk '{printf "echo %s:\nls -l %s\n",$1,$1}'|bash |grep -v insg
OCR-D-IMG:
-rwxrwx--- 1 jb jb 196697460 Apr  5 12:28 00001.tif
OCR-D-001:
-rw-r--r-- 1 root jb 3972601 Apr  5 12:56 OCR-D-001_00001-BIN_wolf.png
-rw-r--r-- 1 root jb    1113 Apr  5 12:56 OCR-D-001_00001.xml
OCR-D-002:
-rw-r--r-- 1 root jb 3876965 Apr  6 11:25 OCR-D-002_00001.IMG-CROP.png
-rw-r--r-- 1 root jb    2009 Apr  6 11:25 OCR-D-002_00001.xml
OCR-D-003:
-rw-r--r-- 1 root jb    2309 Apr  6 11:25 OCR-D-003_00001.xml
-rw-r--r-- 1 root jb 3972601 Apr  6 11:25 OCR-D-IMG_00001-BIN_wolf.png
OCR-D-004:
-rw-r--r-- 1 root jb 3876965 Apr  6 11:27 OCR-D-004_00001.IMG-DESKEW.png
-rw-r--r-- 1 root jb    3260 Apr  6 11:27 OCR-D-004_00001.xml
OCR-D-OCR:
-rw-r--r-- 1 root jb 6028333 Apr  6 11:36 OCR-D-OCR_00001.IMG-BIN.png
-rw-r--r-- 1 root jb  929386 Apr  6 11:36 OCR-D-OCR_00001.xml

Preview

bertsky commented 6 months ago

Thanks @jbarth-ubhd for the detailed report!

Well, this is an extreme case to begin with: a huge image (65 MP), images with lots of fine strokes. Tesseract itself has no cropping, we only emulate that (as the processor says) by trying to find text regions. And Tesseract is quite prone to hallucinating text in such line drawings (since it was written for contemporary documents). Since it also likes to draw a full-sized image region all over the canvas as soon as there is a visible page frame, one needs to use sparse text mode, which is usually faster, but extremely slow in this case. The OCR-D wrapper cannot do much about that, I'm afraid (notice that in your log the huge time delay happens between calling the Tesseract API and starting to process its results).

Here is how the sparse text layout analysis result looks like on the raw image: tesscli_sparsetext

In my case, this took nearly 40h to compute. Clearly, most of these text regions are false positives.

Perhaps what one can do is downsample the image to a reasonable resolution (say 200 DPI). But then all follow-up calculations (coordinates, derived images) have to compensate for that. (I have done this in ocrd_detectron2 once before.)

bertsky commented 6 months ago

BTW, running on the binarized image (as in your workflow), it takes even longer (77h), because the wolf binarization cannot cope with the black border (which it inverts), so even more FP are found: tesscli_bin_sparsetext

So, as a rule, when doing binarization, and you might still have black borders, do not use wolf, but sauvola or sbb.

Downsampling by 4 (convert -scale 25%) to 195 DPI does help: processing time is cut to just a few seconds each, and results are equally non/usable: tesscli25%_sparsetext

tesscli25%_bin_sparsetext

So, don't use ocrd-tesserocr-crop on material which has next to no text (but ocrd-anybaseocr-crop or eynollah instead).

Also, don't run with too huge images. Downsample before importing, as we cannot expect processors to do that themselves for now.

Perhaps we should open an issue in core for the general scenario of early downsampling (as a derived image) and then re-using that image instead of the original (with adapted coordinate system), which will in turn depend on PAGE being extended with AlternativeImage scale attributes, though.

bertsky commented 6 months ago

Perhaps we should open an issue in core for the general scenario of early downsampling (as a derived image) and then re-using that image instead of the original (with adapted coordinate system), which will in turn depend on PAGE being extended with AlternativeImage scale attributes, though.

@kba what's your opinion on this?

OCR-D / ocrd_tesserocr

ocrd-tesserocr-crop: 22.5h processing time #206