documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
832 stars 214 forks source link

Add parallel processing to OCR text extraction of full documents #124

Open ntodd opened 9 years ago

ntodd commented 9 years ago

Leverage the GNU Parallel tool to OCR multiple pages in parallel. If Parallel is installed, a full document extraction will generate an image for each page and then spawn a tesseract process for each available core. If Parallel is not installed or a subset of pages are indicated, the old behavior will be used. This speeds up OCR processing significantly on multi-core machines.

With a bit more work, this could be leveraged by the other OCR code paths.

deuxshaish commented 9 years ago

I like this a lot.. Will test and observe, thanks for the commit

pickhardt commented 1 year ago

This is a great idea.