deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

Pdfminer and Tesseract not found #321

Open ObitoSigma opened 4 years ago

ObitoSigma commented 4 years ago

Using Python 3.7.6, Pip 20.0.2, Conda 4.8.2, Spyder 4.0.1, and Textract 1.6.3.

When using textract.process('url', method='METHOD'), 'pdftotext' executes without problem (but the pdf is not text so it prints gibberish). When I try using 'tesseract' or 'pdfminer', I get the following (2?) error(s) which I'm hoping to resolve (example below is tesseract). Not well-versed with programming languages so let me know if it's anything obvious.

Traceback (most recent call last):

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run pipe = subprocess.Popen(

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in init super(SubprocessPopen, self).init(*args, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in init self._execute_child(args, executable, preexec_fn, close_fds,

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\untitled0.py", line 9, in text = textract.process('C:/Users/hanto/Desktop/Peapod1.pdf', method='tesseract', language='eng')

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers_init_.py", line 77, in process return parser.process(filename, encoding, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 46, in process byte_string = self.extract(filename, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 33, in extract return self.extract_tesseract(filename, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 61, in extract_tesseract page_content = TesseractParser().extract(page_path, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\image.py", line 20, in extract stdout, _ = self.run(args)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 90, in run raise exceptions.ShellError(

ShellError: The command tesseract C:\Users\hanto\AppData\Local\Temp\tmpqsxyoes8\conv-1.ppm stdout -l eng failed with exit code 127 ------------- stdout ------------- ------------- stderr -------------
jpweytjens commented 4 years ago

Tesseract is an external dependency that is not automatically installed along with textract. Since you're using Conda, you should be able to install the package in this link.

Pdfminer however is a Python dependency and should have been installed with textract. Could you show the complete error log when trying the pdfminer method?

ObitoSigma commented 4 years ago

I installed tesseract via the link but still got the same error message. Here is the error for pdfminer:

Traceback (most recent call last):

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run
    pipe = subprocess.Popen(

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in __init__
    super(SubprocessPopen, self).__init__(*args, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

FileNotFoundError: [WinError 2] The system cannot find the file specified

Traceback (most recent call last):

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run
    pipe = subprocess.Popen(

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in __init__
    super(SubprocessPopen, self).__init__(*args, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "C:\Users\hanto\AppData\Local\Temp\untitled0.py", line 9, in <module>
    text = textract.process('C:/Users/hanto/Desktop/Peapod1.pdf', method='pdfminer', language='eng')

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
    return self.extract_pdfminer(filename, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
    stdout, _ = self.run(['pdf2txt.py', filename])

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 90, in run
    raise exceptions.ShellError(

ShellError: The command `pdf2txt.py C:/Users/hanto/Desktop/Peapod1.pdf` failed with exit code 127
------------- stdout -------------
------------- stderr -------------
RaSan147 commented 3 years ago

do you have pdf2txt binary installed in your computer?