Error extracting text from pdf

deanmalmgren / textract

extract text from any document. no muss. no fuss.

http://textract.readthedocs.io

MIT License

3.89k stars 599 forks source link

Error extracting text from pdf #396

Closed sirwentemi closed 3 years ago

sirwentemi commented 3 years ago

Describe the bug I need help figuring out how to resolve this error, when I run code below text = textract.process("C:/Users/house-phase/Desktop/topic-modelling/pdf/JC.pdf", method='pdfminer')

Desktop (please complete the following information):

OS: Windows 10
Textract version : 1.6.4
Python version : Python 3.8.5
Virtual environment (no) Thanks in advance

jpweytjens commented 3 years ago

Windows path contain forward slashes that don't get interpreted correctly when using a string. You can either use a raw string r"C:/Users/house-phase/Desktop/topic-modelling/pdf/JC.pdf" or the pathlib library to handle the slashes for you.

import pathlib
file = pathlib.Path("C:/Users/house-phase/Desktop/topic-modelling/pdf/JC.pdf")
text = textract.process(str(file), method='pdfminer'