deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Error extracting text from pdf #396

Closed sirwentemi closed 3 years ago

sirwentemi commented 3 years ago

Describe the bug I need help figuring out how to resolve this error, when I run code below text = textract.process("C:/Users/house-phase/Desktop/topic-modelling/pdf/JC.pdf", method='pdfminer') image

Desktop (please complete the following information):

jpweytjens commented 3 years ago

Windows path contain forward slashes that don't get interpreted correctly when using a string. You can either use a raw string r"C:/Users/house-phase/Desktop/topic-modelling/pdf/JC.pdf" or the pathlib library to handle the slashes for you.

import pathlib
file = pathlib.Path("C:/Users/house-phase/Desktop/topic-modelling/pdf/JC.pdf")
text = textract.process(str(file), method='pdfminer'