Open Lswx2017 opened 6 years ago
could be same as this? https://github.com/deanmalmgren/textract/issues/107 looks like pip install chardet==2.1.1 can solve the problem for python 2
Same error here, but NOT for all my PDF files.
Python 3.6.5 textract==1.6.1 chardet==2.3.0
"chardet.detect(text)" (utils.py, 64) returns {'encoding': None, 'confidence': 0.0}
text = textract.process(file, method='pdfminer')
Error: UnboundLocalError Traceback (most recent call last)
Looks similar to #261 and #256. Could you try again with textract 1.6.2
? This version updates chardet to 3.0.4
.
Hi, I am hitting this error when I try to textract this PDF: Acta_Acustica_High_frequency_mistuning_2018.pdf
I am using textract 1.6.1 (latest version available via pip install) and chardet 3.0.4.
The output of chardet on the same file is "no result": $ chardet Acta_Acustica_High_frequency_mistuning_2018.pdf Acta_Acustica_High_frequency_mistuning_2018.pdf: no result
UPDATE: @jpweytjens, just saw your instruction on how to install a more recent textract on #261, so I tried again after installing textract 1.6.3. The error is exactly the same:
$ textract Acta_Acustica_High_frequency_mistuning_2018.pdf
Traceback (most recent call last):
File "/home/asartori/Dropbox/OSC/manuscript_version_detection/venv/bin/textract", line 33, in
UPDATE 2: Just for completeness, textract does not run into errors if I use method pdfminer, but it returns a bytes object rather than string: $ text = textract.process("Acta_Acustica_High_frequency_mistuning_2018.pdf", method="pdfminer") $ text[0:100] b'(cid:1)(cid:3)(cid:14)(cid:1) (cid:1)(cid:3)(cid:15)(cid:13)(cid:14)(cid:9)(cid:3)(cid:1) (cid:15)(c' $ type(text) <class 'bytes'>
@afs25 I'm aware that textract returns bytes
objects where it should be returning strings
instead. In the meanwhile, you can decode the textract output with the required decoding.
text = textract.process("Acta.pdf", method="pdfminer").decode("utf8")
As for the failing with chardet
, I'm currently far away from any computer. Feel free to ping me again in 2 weeks if I haven't fixed it these issues by then.
Sent with GitHawk
@afs25 I'm aware that textract returns
bytes
objects where it should be returningstrings
instead. In the meanwhile, you can decode the textract output with the required decoding.text = textract.process("Acta.pdf", method="pdfminer").decode("utf8")
As for the failing with
chardet
, I'm currently far away from any computer. Feel free to ping me again in 2 weeks if I haven't fixed it these issues by then.Sent with GitHawk Hello,Sir. Any solution right now?
With pdftotext
, there is absolutely no need to guess the encoding with chardet
, because pdftotext
always outputs UTF-8, unless specified otherwise with the -enc
option:
$ man pdftotext|grep -C3 UTF-8
Generate an XHTML file containing bounding box information for each block, line, and word in the file.
-enc encoding-name
Sets the encoding to use for text output. This defaults to "UTF-8".
-listenc
Lits the available encodings
Please stop using chardet
with pdftotext
and just treat the output as valid UTF-8
.
You users would be very thankful. :)
What about other methods, e.g does pdfminer or tesseract always return utf-8? Should we attempt to use chardet from the textract package or
from textract import process
from chardet import detect
text = process("file.pdf",method="tesseract",language="srp+srp_latn")
print(text.decode(detect(text)["encoding"]))
pdftotext works well only for simple pdf's, pdfminer/tesseract work better for my file but neither really return correct results, don't know how I should debug tesseract as it doesn't directly support pdf's, textract uses pdftoppm, right? Complaining here makes no sense if I can't make it work with just the tools in the background
When i extract text from a pdf, it output: Traceback (most recent call last): File "/usr/bin/textract", line 32, in
main()
File "/usr/bin/textract", line 25, in main
output = process(vars(args))
File "/usr/lib/python2.7/site-packages/textract/parsers/init.py", line 77, in process
return parser.process(filename, encoding, kwargs)
File "/usr/lib/python2.7/site-packages/textract/parsers/utils.py", line 47, in process
unicode_string = self.decode(byte_string)
File "/usr/lib/python2.7/site-packages/textract/parsers/utils.py", line 65, in decode
return text.decode(result['encoding'])
TypeError: decode() argument 1 must be string, not None
1.pdf