Open chenziliang0725 opened 6 months ago
There could be ways to do this, at least for tesseract. I'll take a look.
Currently the code runs the Tesseract executable with a list of page images. Then Tesseract processes those images one by one which takes some time.
zotero-ocr could accelerate the recognition by running several parallel Tesseract processes, but that would increase the complexity because it would require an additional processing step to combine the results of the different Tesseract processes.
I think it would be easier to add a reasonable multithreading to the Tesseract code. The current multithreading in Tesseract is not helpful, but multithreading on the page level would have a large benefit.
I agree. My plan was to investigate the current Tesseract situation before writing any code here, so thanks for this input.
The way I see it, the ideal situation for us would be if someone implemented this Tesseract issue https://github.com/tesseract-ocr/tesseract/issues/3750 . Then we'd get the functionality at a minimal cost (maybe just adding a preference for the number of threads).
Sadly, the Tesseract issue is 2 years old with no activity in sight, so I'm not really confident it will happen soon. I don't have the proper skill set to contribute on that side, unfortunately. However, I think the added complexity to implement this within the Zotero-OCR code might be manageable... I'd like to try.
Tesseract and poppler only produce pages one by one now. When there are dozens of pages, it work slowly. Can we increase multithreading processing capability