tesseract: lower memory consumption by early write of diagnostics images

GerHobbelt / W

tracking bugs, caveats, reminders and ramblings in and of my public clones/forks

BSD 3-Clause "New" or "Revised" License

2 stars 1 forks source link

tesseract: lower memory consumption by early write of diagnostics images #11

Open GerHobbelt opened 3 months ago

GerHobbelt commented 3 months ago

PLUS measure their construction + writing-to-disk timing for the performance report statistics; preferably identifiable stats i.e. which image cost how much?

GerHobbelt commented 3 months ago

PLUS assign them a numeric index for referencing from the PROCESS texts. (Think: the two red mask images we produce now for pass 1 and pass 2 recog: it would be very nice to reference them from the generated text/html for easier perusal.

GerHobbelt commented 3 months ago

Example of > 8GByte consumption:

# 8+ GByte usage!
--loglevel ALL -l eng --psm 11 --oem 3 --tessdata-dir E:\ocr/tessdata_best -c debug_file=tessdata_best-PSM11-OEM3-TH0-USIZE-2700x13366-debug.log -c thresholding_method=0 -c document_title=tessdata_best-PSM11-OEM3-TH0-1001-000-0003-b-leveled E:\ocr\sizes/D_RUN_data-1001-000-0003-b-leveled/DERIVSRC-1001-000-0003-b-leveled-USIZE-2700x13366.webp tessdata_best-PSM11-OEM3-TH0-USIZE-2700x13366 hocr txt tsv wordstrbox E:\ocr\sizes/tess_run_01_D_RUN.conf

from https://github.com/GerHobbelt/tesseract-bulk-testing-fun-with-assertion-failures

GerHobbelt commented 2 months ago

PLUS do this in a thread pool, i.e. remove it from the critical execution path, as it turns out producing WebP takes a heck of a lot of time! Great, tiny, lossless images but boy, does it guzzle CPU cycles to produce these!

PNG is produced faster, as is 99% quality JPEGs, but they all require nontrivial amounts of CPU time, which can be off-loaded to other cores, at least when doing this in single runs.

For tesseract batch processing, we loose either way so we need to find a tolerable optimum output format (compression time vs. consumed disk space) for that. Decompressing afterwards is an alternative we also may want to consider, i.e. completely offloading the image minification to outside means.

GerHobbelt commented 2 months ago

https://developers.google.com/speed/webp/docs/webp_study
https://emmaliu.info/15418-Final-Project/final_report.pdf
AVIF as alternative for WebP: haven't been able to build the former successfully in monolithic build mode yet. For that it may help to look at:
https://gist.github.com/Aaron2550/3bd29b9315f37194f25aee3e83853de4
https://chromium.googlesource.com/codecs/libgav1/