Open deanmalmgren opened 9 years ago
Now that I'm thinking about it, this is also related to #50, #51, and #52, whose goal is to offer some viable python alternatives to the existing command line implementations in case someone can't install all of the system packages that are required on their system.
In general, it would be great to come up with some easy and clear ways to have reliable fallback behavior to make textract as easy to use as possible. One way this could be implemented is by having an ordered list of methods to try (ordered by likely text extraction fidelity) where things naturally fallback to try other methods when the "best guess" doesn't work. I'm sure other programs have thought about this behavior quite a bit; any suggestions out there?
Are there any news on this?
No there isn't. Please feel free to put together a PR if this would be useful for you though; contributions welcome @Ninoninoninonino
With the way that the pdf parser currently works, you have to know beforehand whether the pdf is a scanned image or whether it has embedded text. This is inconvenient for end users. A better option would be: