deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.92k stars 609 forks source link

treat ShellError when call pdf2txt.py #495

Open dhrim opened 10 months ago

dhrim commented 10 months ago

Environment:

When execute next code,

textract.process("test.pdf", method='pdfminer')

it failed with error message

b"/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 1: A command line tool for extracting text and images from PDF and output it to plain text, html, xml or tags.: command not found\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 2: import: command not found\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 3: import: command not found\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 4: import: command not found\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 5: import: command not found\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 7: import: command not found\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 8: import: command not found\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 9: from: command not found\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 14: syntax error near unexpected token `def'\n/Users/rim/.pyenv/versions/3.10.6/bin/pdf2txt.py: 
line 14: `def extract_text(files=[], outfile='-','\n"

self.run(...) of following code in textract/parsers/pdf_parser.py throws not OSError but ShellError, so SheelError is not treated . Just ShellError added in catch block.

class Parser(ShellParser):

    def extract_pdfminer(self, filename, **kwargs):
            ...

        try:
            stdout, _ = self.run(['pdf2txt.py', filename])
            except OSError:
            ...