deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Textract unicodeDecodeError #277

Closed kanika102 closed 5 years ago

kanika102 commented 5 years ago

Utf-8 codec cant decode byte 0x90 in position 11:invalid start byte

tekumara commented 5 years ago

Also experiencing this trying to extract a PDF, Python 3.7.2 on Mac OS X:

Traceback (most recent call last):
  File "/Users/tekumara/.local/share/virtualenvs/textract/bin/textract", line 32, in <module>
    main()
  File "/Users/tekumara/.local/share/virtualenvs/textract/bin/textract", line 25, in main
    output = process(**vars(args))
  File "/Users/tekumara/.local/share/virtualenvs/textract/lib/python3.7/site-packages/textract/parsers/__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)
  File "/Users/tekumara/.local/share/virtualenvs/textract/lib/python3.7/site-packages/textract/parsers/utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "/Users/tekumara/.local/share/virtualenvs/textract/lib/python3.7/site-packages/textract/parsers/txt_parser.py", line 9, in extract
    return stream.read()
  File "/Users/tekumara/.local/share/virtualenvs/textract/bin/../lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1095: invalid start byte
pip freeze ``` argcomplete==1.8.2 beautifulsoup4==4.5.3 chardet==2.3.0 docx2txt==0.6 EbookLib==0.15 lxml==4.3.3 Pillow==6.0.0 pocketsphinx==0.1.3 python-pptx==0.6.5 six==1.10.0 SpeechRecognition==3.6.3 textract==1.6.1 xlrd==1.0.0 XlsxWriter==1.1.7 ```
locale ``` LANG="en_AU.UTF-8" LC_COLLATE="en_AU.UTF-8" LC_CTYPE="en_AU.UTF-8" LC_MESSAGES="en_AU.UTF-8" LC_MONETARY="en_AU.UTF-8" LC_NUMERIC="en_AU.UTF-8" LC_TIME="en_AU.UTF-8" LC_ALL="en_AU.UTF-8" ```
tekumara commented 5 years ago

My PDF file had no extension. Giving it an extension of .pdf resolved the issue.

jpweytjens commented 5 years ago

@k1995anika Can you provide a test file?

jpweytjens commented 5 years ago

I'm closing this issue due to inactivity. If you still encounter the issue with the latest version of textract, feel free to leave a comment with additional information and I'll reopen the issue.