Enable PDF input for tesseract engine

deajan / pmOCR

A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR conversion on file activity

BSD 3-Clause "New" or "Revised" License

64 stars 15 forks source link

Enable PDF input for tesseract engine #3

Closed mhelff closed 8 years ago

mhelff commented 9 years ago

Hi, this time its a bit more complex...

since tesseract cannot use PDF files as input files, they have to be converted to TIFF first. Ghostscript must be installed to use this feature. I also unwinded the huge OCR-one-liner for tesseract only. Dont know if you like the while-loop + function instead, i'm open for comments.

deajan commented 9 years ago

Thanks for the PR, I'll merge that next week (I must refactor the ugly one liner anyway). Btw, I had bad results using -r300x300and had use -density 300 -units pixelsperinch instead to have consistent TIFFs for some apps that would show the result image bigger as it should be.

deajan commented 8 years ago

Totally forgot about this PR...