deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.77k stars 1.84k forks source link

Speed up `PDFToTextOCRConverter` process #4257

Closed bilgeyucel closed 1 year ago

bilgeyucel commented 1 year ago

Is your feature request related to a problem? Please describe. PDFToTextOCRConverter is working slow. Converting one file takes approximately 40 secs. We can try to find a way to increase its speed.

Describe the solution you'd like Inspired from #4226, multiprocessing might be a possibility to speed up the converting process of multiple pdfs.

Describe alternatives you've considered N/A

Additional context Related discussion: #4232

danielbichuetti commented 1 year ago

It will probably speed the processing a lot, constrained only to the hardware where the converter will be run. The best solution for pytesseract would be multiprocessing, similar to #4226, with total isolation of the object, as there are issues when sharing any objects from it.

Tesseract can be hardware intensive, maybe it would be better to let the user turn on mp, the opposite from #4226.

I'm doing some experiments with PaddleOCR too.

danielbichuetti commented 1 year ago

Ok, tests are complete. I could find one way (based on #4226) to let the user choose to OCR full page, or let Haystack select images, OCR only them and merge with text. Full OCR is a bit faster because it uses multiprocessing now too. When automatic mode is True, it's even faster.

The test document is textual, just the first page has pictures: Full mode: 1000 pages - 6m 48.8s Automatic mode: 1000 pages - 7.1s

This technique provides great results when users have mixed PDFs.

I'll integrate in the following days.