Closed kanika102 closed 5 years ago
Also experiencing this trying to extract a PDF, Python 3.7.2 on Mac OS X:
Traceback (most recent call last):
File "/Users/tekumara/.local/share/virtualenvs/textract/bin/textract", line 32, in <module>
main()
File "/Users/tekumara/.local/share/virtualenvs/textract/bin/textract", line 25, in main
output = process(**vars(args))
File "/Users/tekumara/.local/share/virtualenvs/textract/lib/python3.7/site-packages/textract/parsers/__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "/Users/tekumara/.local/share/virtualenvs/textract/lib/python3.7/site-packages/textract/parsers/utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "/Users/tekumara/.local/share/virtualenvs/textract/lib/python3.7/site-packages/textract/parsers/txt_parser.py", line 9, in extract
return stream.read()
File "/Users/tekumara/.local/share/virtualenvs/textract/bin/../lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1095: invalid start byte
My PDF file had no extension. Giving it an extension of .pdf
resolved the issue.
@k1995anika Can you provide a test file?
I'm closing this issue due to inactivity. If you still encounter the issue with the latest version of textract, feel free to leave a comment with additional information and I'll reopen the issue.
Utf-8 codec cant decode byte 0x90 in position 11:invalid start byte