brendonh / pyth

Python text markup and conversion
MIT License
89 stars 79 forks source link

Unicode error when reading RTF #42

Open pombredanne opened 7 years ago

pombredanne commented 7 years ago

When trying to read https://www.gnu.org/licenses/lgpl.rtf I get:

>>> b=Rtf15Reader.read(open('lgpl.rtf', 'rb'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 86, in read
    return reader.go()
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 109, in go
    self.parse()
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 143, in parse
    self.group.handle(control, digits)
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 402, in handle
    handler(digits)
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 521, in handle_ansi_escape
    char = chr(code).decode(self.charset, self.reader.errors)
UnicodeDecodeError: 'cp932' codec can't decode byte 0x81 in position 0: incomplete multibyte sequence
pombredanne commented 7 years ago

@brendonh Are you still maintaining this repo?

hongtaicao commented 7 years ago

I got the same error when parsing the following rtf file. 00938.rtf.docx

I added the .docx extension for uploading purpose only.

log: Traceback (most recent call last): File "C:\test\rtf.py", line 12, in doc = Rtf15Reader.read(open(pathname, 'r')) File "C:\Program Files\Python27\lib\site-packages\pyth\plugins\rtf15\reader.py", line 86, in read return reader.go() File "C:\Program Files\Python27\lib\site-packages\pyth\plugins\rtf15\reader.py", line 109, in go self.parse() File "C:\Program Files\Python27\lib\site-packages\pyth\plugins\rtf15\reader.py", line 143, in parse self.group.handle(control, digits) File "C:\Program Files\Python27\lib\site-packages\pyth\plugins\rtf15\reader.py", line 402, in handle handler(digits) File "C:\Program Files\Python27\lib\site-packages\pyth\plugins\rtf15\reader.py", line 521, in handle_ansi_escape char = chr(code).decode(self.charset, self.reader.errors) UnicodeDecodeError: 'gbk' codec can't decode byte 0xc8 in position 0: incomplete multibyte sequence

alantygel commented 5 years ago

Same here: 'cp950' codec can't decode byte 0xfa in position 0: incomplete multibyte sequence