deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Increases dpi for tesseract extraction of pdfs #281

Closed parry-do closed 5 years ago

parry-do commented 5 years ago

Tesseract recommends a minimum dpi of 300 (here) and the default pdftoppm dpi is 150. I experienced poor accuracy on some documents and increasing the dpi fixed the issue. This could be controllable by a keyword argument, but the tesseract recommended setting seems like a better default. Love this package, it's a lifesaver.

jpweytjens commented 5 years ago

The pdf parser will get a dpi keyword with a default value of 300 in the upcoming update. Thanks for the PR!