cfculhane / AnkiOCR

Anki Addon to create searchable text from images in notes, using Tesseract OCR
MIT License
17 stars 5 forks source link

Faster batch OCR via ThreadPoolExecutor #7

Closed phu54321 closed 3 years ago

phu54321 commented 4 years ago

This speeds up OCR via multithreaded tesseract instances. Useful for batch-OCRing hundreds of notes.

I personally have about 10,000 notes to OCR, so this fix was necessary.

phu54321 commented 3 years ago

This isn't still optimal, as it issues a new instance of tesseract for each image. Optimally it should batch images to some chunks that are OCRed on one tesseract run. Tesseract can OCR multiple images at once.

If you're OCRing 100+ images addon could split them into a list of 10 images segment and apply tesseract to each of them.

cfculhane commented 3 years ago

Thanks for looking into this, I'll have a crack at batching them today or tomorrow and push a change out.

On Mon, 9 Nov 2020 at 11:41, 박현우 notifications@github.com wrote:

This isn't still optimal, as it issues a new instance of tesseract for each image. Optimally it should batch images to some chunks that are OCRed on one tesseract run. Tesseract can OCR multiple images at once.

If you're OCRing 100+ images addon could split them into a list of 10 images segment and apply tesseract to each of them.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cfculhane/AnkiOCR/pull/7#issuecomment-723694256, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJRYIU7RVJ3HF7KR7PO5LCTSO43E7ANCNFSM4TKFJHIA .

cfculhane commented 3 years ago

In the middle of a big re-factor which I think will help allow this approach. The aim it so untangle the processing step which is currently in one function OCR.ocr_process() , into three steps, allowing easier batching and debugging. Steps would be :

  1. Collect all notes via a query to the collection, and build a List of notes to process, represented as a list of NoteImages dataclasses (which will have a hierarchy of NoteImages -> FieldImages - OCRImage , all lightweight dataclasses.
  2. Pass this into a process_notes() method, allowing it to be batched/split up, multithreaded etc
  3. After processing is complete, or even concurrently with the above asynchronously, add these modified NoteImages back into the collection and save the updates to the db.

Thoughts on this?

I'm also writing some basic unit tests to ease development pain, and also to flesh the project out a bit more.

phu54321 commented 3 years ago

That'd be great. I havent modified the overall structure to keep the code consistent with previous version, but if you're in for structural change that'd of course be better :)

cfculhane commented 3 years ago

Looking more into this, as you noted Tesseract can already consume a list of images to batch-process. The PyTesseract library has an issue that discusses speed: https://github.com/madmaze/pytesseract/issues/261

I then tried the difference in speed between sending each image path to pytessert sequentially, or sending a list of paths to tesseract in a text file, and the final option - using multiprocessing.pool to spawn multiple tesseract processes. For a small number of images (say 5) the overhead from multiprocessor is not worth it, but over lots of images (e.g. 200), the difference was significant, for 200 images: time_multiprocessing = 57 s time_multiprocessing_batched 43.982 s(sending batches of 10 images to each worker) = *s time_batched_input = 139.3 s (input.txt sent to tesseract) time_iter = 171s* (each image path sent sequentially to tesseract)

Note that tesseract already uses multi threading, but doesn't appear to use multiple cores. Also, using the same number of cores as exists in your machine locked things up pretty well with 100% utilisation on all cores!

phu54321 commented 3 years ago

My current PR utilizes ThreadPool, which acts like a multiprocessing but with way less overhead (python thread based). Note that current performance bottleneck is the OCR process itself, which is not in Python so GIL doesn't affect much.

cfculhane commented 3 years ago

Even better then, results with ThreadPool are faster than mp.Pool which surprised me. Due to the refactoring changes, I won't merge this PR, but rest assured the next version will be using ThreadPool with a user-configurable MAX_THREADS, as it tends to hog all of the CPU haha.

Final results, showing 4.4x improvement:

NUM PROCESSES/THREADS : 3
BATCH_SIZE : 10
Number of images = 200
'gen_batched_txts' completed in 0.008 s
Generated 20 batches of max 10 images
'seq_iter' completed in 170.1169 s
'big_batch' completed in 132.7926 s
'mp_pool' completed in 63.4509 s
'mp_pool_batched' completed in 57.1407 s
'thread_pool_exec' completed in 48.9472 s
'process_pool_exec' completed in 41.051 s
'process_pool_exec_batched' completed in 38.7359 s
'thread_pool_exec_batched' completed in 38.5955 s

Thanks again for raising this PR!

phu54321 commented 3 years ago

Thanks for having time :)

cfculhane commented 3 years ago

Closing, dev branch now uses ThreadPoolExecutor, see https://github.com/cfculhane/AnkiOCR/commit/ecc36c8e87f1e971c4c2f6a2956850e86e863ae0

thiswillbeyourgithub commented 3 years ago

Hi, sorry to reopen this,

Reading this thread made me wonder something. I reaaally don't want to sound like an a** but aren't you both confusing multithreading and multiprocessing?

I actually learned today that I was mistaking the two for a while. So I figured I'm maybe not alone..

I spent the whole morning trying different script for myself comparing different implementation and different settings and (not related to OCR) and it seems to me like multiprocessing seems more interesting in this case than multithreading don't you think? Especially if tesseract already uses multiple threads on its own.

@cfculhane That would explain why I didn't get 6 core at 100% when I was setting the num_thread to 6 some time ago... That's so obvious now.

I would think this might be related to why anki is often freezing during the execution.

I gathered quite a lot of snippets from this morning if you are interested in taking a look.

Again, I really don't want to appear like I'm not grateful or anything, especally given that you two are obviously better coders..

phu54321 commented 3 years ago

That's a very legitimate question, actually.

I spent the whole morning trying different script for myself comparing different implementation and different settings and (not related to OCR)

You should try with OCR. OCR is not in python, and this is important. Limitations of multithreading in python don't apply here. Python uses multiprocessing instead of multi-threading because of GIL. Reference. Since we're not using python for OCRing we don't need multiprocessing.

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

Non-python languages often use multithreading instead of multiprocessing. You can read these things.

thiswillbeyourgithub commented 3 years ago

That is extremely useful to me. Thank you very much.