UB-Mannheim / zotero-ocr

Zotero Plugin for OCR
GNU Affero General Public License v3.0
551 stars 40 forks source link

Increase multithreading processing capability #72

Open chenziliang0725 opened 6 months ago

chenziliang0725 commented 6 months ago

Tesseract and poppler only produce pages one by one now. When there are dozens of pages, it work slowly. Can we increase multithreading processing capability

aborel commented 6 months ago

There could be ways to do this, at least for tesseract. I'll take a look.

stweil commented 6 months ago

Currently the code runs the Tesseract executable with a list of page images. Then Tesseract processes those images one by one which takes some time.

zotero-ocr could accelerate the recognition by running several parallel Tesseract processes, but that would increase the complexity because it would require an additional processing step to combine the results of the different Tesseract processes.

I think it would be easier to add a reasonable multithreading to the Tesseract code. The current multithreading in Tesseract is not helpful, but multithreading on the page level would have a large benefit.

aborel commented 6 months ago

I agree. My plan was to investigate the current Tesseract situation before writing any code here, so thanks for this input.

aborel commented 6 months ago

The way I see it, the ideal situation for us would be if someone implemented this Tesseract issue https://github.com/tesseract-ocr/tesseract/issues/3750 . Then we'd get the functionality at a minimal cost (maybe just adding a preference for the number of threads).

Sadly, the Tesseract issue is 2 years old with no activity in sight, so I'm not really confident it will happen soon. I don't have the proper skill set to contribute on that side, unfortunately. However, I think the added complexity to implement this within the Zotero-OCR code might be manageable... I'd like to try.