deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 597 forks source link

textract doesn´t work #241

Closed RAZelzner closed 5 years ago

RAZelzner commented 6 years ago

Hey there,

im new to python. Im using Pycharm 2018.2 and the latest version on Anaconda. Im working on windows 10.

After solving all the problems with installing textract on win 10. I got a positive installation result using anaconda prompt. Additional i have import the Project Interpreter from the \continuum\anaconda3\python.exe

My Target is that i want to extract pdf text from large files so save this text as a .txt

I have tried the test_pdf.py files from textract but they dont work. Here is the conclusion code: "textract" is wrong written or cant be found (self translate from german :-/)

So I tried my own as on the textract page. But it doesnt work...:

Code: import textract text = textract.process('pfad/large.pdf')

Results: C:\Users\raz\AppData\Local\Continuum\anaconda3\python.exe "C:/Users/raz/Google Drive/FOM/Master/Master/NurText/Testo.py" Traceback (most recent call last): File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 85, in run stdout=subprocess.PIPE, stderr=subprocess.PIPE, File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\subprocess.py", line 709, in init restore_signals, start_new_session) File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\subprocess.py", line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] Das System kann die angegebene Datei nicht finden

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:/Users/raz/Google Drive/FOM/Master/Master/NurText/Testo.py", line 2, in text = textract.process('pfad/large.pdf') File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers__init__.py", line 77, in process return parser.process(filename, encoding, kwargs) File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 46, in process byte_string = self.extract(filename, kwargs) File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 28, in extract raise ex File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 20, in extract return self.extract_pdftotext(filename, **kwargs) File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 43, in extractpdftotext stdout, = self.run(args) File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 92, in run ' '.join(args), 127, '', '', textract.exceptions.ShellError: The command pdftotext pfad/large.pdf - failed with exit code 127 ------------- stdout ------------- ------------- stderr -------------

Thanks for your help

ywncmt commented 5 years ago

I encountered exactly same problem while extracting pdf files on win10

jpweytjens commented 5 years ago

This is similar to #256 and #261. Please update to textract 1.6.2 and let us know if that doesn't solve the problem.

jpweytjens commented 5 years ago

I'm closing this issue due to inactivity. If you still encounter the issue with the latest version of textract, feel free to leave a comment with additional information and I'll reopen the issue.

selfcs commented 4 years ago

hi, jpweytjens 。I get the same error 。

The first:

OS: Windows Textract version 1.6.3 Python version 3.7 Virtual environment no

I get ShellError('pdftotext d:\WorkPlace\NetState\download_files\2018Q4.pdf -', 127, '', '')

when I use python **.py

image

but I‘m so puzzled。when I use the same python interpreter(miniconda) and os system, but only run in jupyter lab。the code is running without error。

image

but it also not parsing correctly(I tried different encoding)


When I use the same code in Linux system, It's running correctly image

Please excuse my poor English.

selfcs commented 4 years ago

e。It's my fault。I solve it。 because I not install poppler 。 1-Install the Microsoft Visual C++ Build Tools

2- if u use miniconda or anaconda。

conda install -c conda-forge poppler
MohamedBehery commented 2 years ago

e。It's my fault。I solve it。 because I not install poppler 。 1-Install the Microsoft Visual C++ Build Tools

2- if u use miniconda or anaconda。

conda install -c conda-forge poppler

Thanks man, awesome job!

corobin commented 2 years ago

Hello, I have the same problem.

Python 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)] on win32

from pip list: textract 1.6.4 pdfminer.six 20211012

text = textract.process('sample.pdf')
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 87, in run
    pipe = subprocess.Popen(
  File "C:\Program Files\Python310\lib\subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python310\lib\subprocess.py", line 1435, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    text = textract.process('sample.pdf')
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\__init__.py", line 79, in process
    return parser.process(filename, input_encoding, output_encoding, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 29, in extract
    raise ex
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 21, in extract
    return self.extract_pdftotext(filename, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 44, in extract_pdftotext
    stdout, _ = self.run(args)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 95, in run
    raise exceptions.ShellError(
textract.exceptions.ShellError: The command `pdftotext sample.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------
text = textract.process('sample.pdf', method='pdfminer')
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 54, in extract_pdfminer
    stdout, _ = self.run(['pdf2txt.py', filename])
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 87, in run
    pipe = subprocess.Popen(
  File "C:\Program Files\Python310\lib\subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python310\lib\subprocess.py", line 1435, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
OSError: [WinError 193] %1 is not a valid Win32 application

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    text = textract.process('sample.pdf', method='pdfminer')
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\__init__.py", line 79, in process
    return parser.process(filename, input_encoding, output_encoding, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 32, in extract
    return self.extract_pdfminer(filename, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 57, in extract_pdfminer
    stdout, _ = self.run(['python3',pdf2txt_path, filename])
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 87, in run
    pipe = subprocess.Popen(
  File "C:\Program Files\Python310\lib\subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python310\lib\subprocess.py", line 1375, in _execute_child
    args = list2cmdline(args)
  File "C:\Program Files\Python310\lib\subprocess.py", line 561, in list2cmdline
    for arg in map(os.fsdecode, seq):
  File "C:\Program Files\Python310\lib\os.py", line 822, in fsdecode
    filename = fspath(filename)  # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType

from searching other reports, it looks like it might be because a dependency doesn't exist, but the doc says that pdftotext is optional and if it didn't exist there's a pure python fallback, also, i do have pdfminer installed but even when selecting that as a method it still doesn't work. any ideas on what the actual error is? Thanks!

mingjun1120 commented 1 year ago

Hello, I have the same problem.

Python 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)] on win32

from pip list: textract 1.6.4 pdfminer.six 20211012

text = textract.process('sample.pdf')
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 87, in run
    pipe = subprocess.Popen(
  File "C:\Program Files\Python310\lib\subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python310\lib\subprocess.py", line 1435, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    text = textract.process('sample.pdf')
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\__init__.py", line 79, in process
    return parser.process(filename, input_encoding, output_encoding, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 29, in extract
    raise ex
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 21, in extract
    return self.extract_pdftotext(filename, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 44, in extract_pdftotext
    stdout, _ = self.run(args)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 95, in run
    raise exceptions.ShellError(
textract.exceptions.ShellError: The command `pdftotext sample.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------
text = textract.process('sample.pdf', method='pdfminer')
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 54, in extract_pdfminer
    stdout, _ = self.run(['pdf2txt.py', filename])
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 87, in run
    pipe = subprocess.Popen(
  File "C:\Program Files\Python310\lib\subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python310\lib\subprocess.py", line 1435, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
OSError: [WinError 193] %1 is not a valid Win32 application

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    text = textract.process('sample.pdf', method='pdfminer')
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\__init__.py", line 79, in process
    return parser.process(filename, input_encoding, output_encoding, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 32, in extract
    return self.extract_pdfminer(filename, **kwargs)
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\pdf_parser.py", line 57, in extract_pdfminer
    stdout, _ = self.run(['python3',pdf2txt_path, filename])
  File "C:\Program Files\Python310\lib\site-packages\textract\parsers\utils.py", line 87, in run
    pipe = subprocess.Popen(
  File "C:\Program Files\Python310\lib\subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python310\lib\subprocess.py", line 1375, in _execute_child
    args = list2cmdline(args)
  File "C:\Program Files\Python310\lib\subprocess.py", line 561, in list2cmdline
    for arg in map(os.fsdecode, seq):
  File "C:\Program Files\Python310\lib\os.py", line 822, in fsdecode
    filename = fspath(filename)  # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType

from searching other reports, it looks like it might be because a dependency doesn't exist, but the doc says that pdftotext is optional and if it didn't exist there's a pure python fallback, also, i do have pdfminer installed but even when selecting that as a method it still doesn't work. any ideas on what the actual error is? Thanks!

Did you solve it?

corobin commented 1 year ago

Did you solve it?

no i used pdfminer.six

mingjun1120 commented 1 year ago

Did you solve it?

no i used pdfminer.six

I just solved it but I myself also don't know how it was sorted 😂😂😂

The issue at hand pertains to the installation of poppler on a Windows operating system. A possible solution can be found on a StackOverflow page titled "How to install Poppler on Windows?".

On this website, you will find a solution named Download Poppler Packaged for Windows which directs you to a GitHub page containing various versions of poppler-windows. I downloaded the most recent version and extracted the folder. Afterwards, I relocated the unzipped folder to the Program Files (x86) directory and copied the entire path leading to the bin folder: C:\Program Files (x86)\Release-23.05.0-0\poppler-23.05.0\Library\bin. I then pasted this path into the System Environment Variables.

What if System Variables was not editable? Follow the steps below:

1st: Click on the System protection button in the About page. image

2nd: Click the Advance Tab -> Click Environment Variables button

3rd: Click on the Path under the System Variables section, and paste the path you copied.

4th: Restart PC.