Open GerHobbelt opened 3 months ago
PLUS assign them a numeric index for referencing from the PROCESS texts. (Think: the two red mask images we produce now for pass 1 and pass 2 recog: it would be very nice to reference them from the generated text/html for easier perusal.
Example of > 8GByte consumption:
# 8+ GByte usage!
--loglevel ALL -l eng --psm 11 --oem 3 --tessdata-dir E:\ocr/tessdata_best -c debug_file=tessdata_best-PSM11-OEM3-TH0-USIZE-2700x13366-debug.log -c thresholding_method=0 -c document_title=tessdata_best-PSM11-OEM3-TH0-1001-000-0003-b-leveled E:\ocr\sizes/D_RUN_data-1001-000-0003-b-leveled/DERIVSRC-1001-000-0003-b-leveled-USIZE-2700x13366.webp tessdata_best-PSM11-OEM3-TH0-USIZE-2700x13366 hocr txt tsv wordstrbox E:\ocr\sizes/tess_run_01_D_RUN.conf
from https://github.com/GerHobbelt/tesseract-bulk-testing-fun-with-assertion-failures
PLUS do this in a thread pool, i.e. remove it from the critical execution path, as it turns out producing WebP takes a heck of a lot of time! Great, tiny, lossless images but boy, does it guzzle CPU cycles to produce these!
PNG is produced faster, as is 99% quality JPEGs, but they all require nontrivial amounts of CPU time, which can be off-loaded to other cores, at least when doing this in single runs.
For tesseract batch processing, we loose either way so we need to find a tolerable optimum output format (compression time vs. consumed disk space) for that. Decompressing afterwards is an alternative we also may want to consider, i.e. completely offloading the image minification to outside means.
PLUS measure their construction + writing-to-disk timing for the performance report statistics; preferably identifiable stats i.e. which image cost how much?