deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

PDF extract failed! #248

Open Lswx2017 opened 6 years ago

Lswx2017 commented 6 years ago

When i extract text from a pdf, it output: Traceback (most recent call last): File "/usr/bin/textract", line 32, in main() File "/usr/bin/textract", line 25, in main output = process(vars(args)) File "/usr/lib/python2.7/site-packages/textract/parsers/init.py", line 77, in process return parser.process(filename, encoding, kwargs) File "/usr/lib/python2.7/site-packages/textract/parsers/utils.py", line 47, in process unicode_string = self.decode(byte_string) File "/usr/lib/python2.7/site-packages/textract/parsers/utils.py", line 65, in decode return text.decode(result['encoding']) TypeError: decode() argument 1 must be string, not None

1.pdf

matt32106 commented 5 years ago

could be same as this? https://github.com/deanmalmgren/textract/issues/107 looks like pip install chardet==2.1.1 can solve the problem for python 2

mpena2099 commented 5 years ago

Same error here, but NOT for all my PDF files.

Python 3.6.5 textract==1.6.1 chardet==2.3.0

mpena2099 commented 5 years ago

"chardet.detect(text)" (utils.py, 64) returns {'encoding': None, 'confidence': 0.0}

SatyaRamGV commented 5 years ago

text = textract.process(file, method='pdfminer')

Error: UnboundLocalError Traceback (most recent call last)

in () ----> 1 text = textract.process(file, method='pdfminer') ~/.local/lib/python3.6/site-packages/textract/parsers/__init__.py in process(filename, encoding, extension, **kwargs) 75 76 parser = filetype_module.Parser() ---> 77 return parser.process(filename, encoding, **kwargs) 78 79 ~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in process(self, filename, encoding, **kwargs) 44 # output encoding 45 # http://nedbatchelder.com/text/unipain/unipain.html#35 ---> 46 byte_string = self.extract(filename, **kwargs) 47 unicode_string = self.decode(byte_string) 48 return self.encode(unicode_string, encoding) ~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract(self, filename, method, **kwargs) 29 30 elif method == 'pdfminer': ---> 31 return self.extract_pdfminer(filename, **kwargs) 32 elif method == 'tesseract': 33 return self.extract_tesseract(filename, **kwargs) ~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract_pdfminer(self, filename, **kwargs) 46 def extract_pdfminer(self, filename, **kwargs): 47 """Extract text from pdfs using pdfminer.""" ---> 48 stdout, _ = self.run(['pdf2txt.py', filename]) 49 return stdout 50 ~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in run(self, args) 94 # pipe.wait() ends up hanging on large files. using 95 # pipe.communicate appears to avoid this issue ---> 96 stdout, stderr = pipe.communicate() 97 98 # if pipe is busted, raise an error (unlike Fabric) `UnboundLocalError: local variable 'pipe' referenced before assignment`
jpweytjens commented 5 years ago

Looks similar to #261 and #256. Could you try again with textract 1.6.2? This version updates chardet to 3.0.4.

afs25 commented 5 years ago

Hi, I am hitting this error when I try to textract this PDF: Acta_Acustica_High_frequency_mistuning_2018.pdf

I am using textract 1.6.1 (latest version available via pip install) and chardet 3.0.4.

The output of chardet on the same file is "no result": $ chardet Acta_Acustica_High_frequency_mistuning_2018.pdf Acta_Acustica_High_frequency_mistuning_2018.pdf: no result

UPDATE: @jpweytjens, just saw your instruction on how to install a more recent textract on #261, so I tried again after installing textract 1.6.3. The error is exactly the same: $ textract Acta_Acustica_High_frequency_mistuning_2018.pdf Traceback (most recent call last): File "/home/asartori/Dropbox/OSC/manuscript_version_detection/venv/bin/textract", line 33, in main() File "/home/asartori/Dropbox/OSC/manuscript_version_detection/venv/bin/textract", line 25, in main output = process(vars(args)) File "/home/asartori/Dropbox/OSC/manuscript_version_detection/venv/lib/python3.6/site-packages/textract/parsers/init.py", line 77, in process return parser.process(filename, encoding, kwargs) File "/home/asartori/Dropbox/OSC/manuscript_version_detection/venv/lib/python3.6/site-packages/textract/parsers/utils.py", line 47, in process unicode_string = self.decode(byte_string) File "/home/asartori/Dropbox/OSC/manuscript_version_detection/venv/lib/python3.6/site-packages/textract/parsers/utils.py", line 65, in decode return text.decode(result['encoding']) TypeError: decode() argument 1 must be str, not None

UPDATE 2: Just for completeness, textract does not run into errors if I use method pdfminer, but it returns a bytes object rather than string: $ text = textract.process("Acta_Acustica_High_frequency_mistuning_2018.pdf", method="pdfminer") $ text[0:100] b'(cid:1)(cid:3)(cid:14)(cid:1) (cid:1)(cid:3)(cid:15)(cid:13)(cid:14)(cid:9)(cid:3)(cid:1) (cid:15)(c' $ type(text) <class 'bytes'>

jpweytjens commented 5 years ago

@afs25 I'm aware that textract returns bytes objects where it should be returning strings instead. In the meanwhile, you can decode the textract output with the required decoding.

text = textract.process("Acta.pdf", method="pdfminer").decode("utf8")

As for the failing with chardet, I'm currently far away from any computer. Feel free to ping me again in 2 weeks if I haven't fixed it these issues by then.

Sent with GitHawk

yeshanliu commented 5 years ago

@afs25 I'm aware that textract returns bytes objects where it should be returning strings instead. In the meanwhile, you can decode the textract output with the required decoding.

text = textract.process("Acta.pdf", method="pdfminer").decode("utf8")

As for the failing with chardet, I'm currently far away from any computer. Feel free to ping me again in 2 weeks if I haven't fixed it these issues by then.

Sent with GitHawk Hello,Sir. Any solution right now?

erosennin commented 5 years ago

With pdftotext, there is absolutely no need to guess the encoding with chardet, because pdftotext always outputs UTF-8, unless specified otherwise with the -enc option:

$ man pdftotext|grep -C3 UTF-8
             Generate an XHTML file containing bounding box information for each block, line, and word in the file.

       -enc encoding-name
              Sets the encoding to use for text output. This defaults to "UTF-8".

       -listenc
              Lits the available encodings

Please stop using chardet with pdftotext and just treat the output as valid UTF-8.

You users would be very thankful. :)

filipopo commented 4 years ago

What about other methods, e.g does pdfminer or tesseract always return utf-8? Should we attempt to use chardet from the textract package or

from textract import process
from chardet import detect
text = process("file.pdf",method="tesseract",language="srp+srp_latn")
print(text.decode(detect(text)["encoding"]))

pdftotext works well only for simple pdf's, pdfminer/tesseract work better for my file but neither really return correct results, don't know how I should debug tesseract as it doesn't directly support pdf's, textract uses pdftoppm, right? Complaining here makes no sense if I can't make it work with just the tools in the background