pdf parser: chain pdftotext/pdfminer + tesseract

deanmalmgren / textract

extract text from any document. no muss. no fuss.

http://textract.readthedocs.io

MIT License

3.86k stars 592 forks source link

pdf parser: chain pdftotext/pdfminer + tesseract #77

Open deanmalmgren opened 9 years ago

deanmalmgren commented 9 years ago

@pudo proposed this idea in https://github.com/deanmalmgren/textract/pull/66#issuecomment-54709071 and I wanted to be sure to capture it before I forget.

With the way that the pdf parser currently works, you have to know beforehand whether the pdf is a scanned image or whether it has embedded text. This is inconvenient for end users. A better option would be:

textract some_pdf.pdf              # try to extract embedded text first. if that fails, try OCR
textract -m tesseract scanned.pdf  # do OCR
textract -m pdftotext embedded.pdf # do text extraction with pdftotext utility

deanmalmgren commented 9 years ago

Now that I'm thinking about it, this is also related to #50, #51, and #52, whose goal is to offer some viable python alternatives to the existing command line implementations in case someone can't install all of the system packages that are required on their system.

In general, it would be great to come up with some easy and clear ways to have reliable fallback behavior to make textract as easy to use as possible. One way this could be implemented is by having an ordered list of methods to try (ordered by likely text extraction fidelity) where things naturally fallback to try other methods when the "best guess" doesn't work. I'm sure other programs have thought about this behavior quite a bit; any suggestions out there?

Ninoninoninonino commented 7 years ago

Are there any news on this?

deanmalmgren commented 7 years ago

No there isn't. Please feel free to put together a PR if this would be useful for you though; contributions welcome @Ninoninoninonino