eastein / mediorc

mediocre irc bot library
2 stars 0 forks source link

unicode parsing issues - UnicodeDecodeError inside irc.buffer #6

Open eastein opened 9 years ago

eastein commented 9 years ago

Traceback (most recent call last): File "./chronbot", line 304, in s.run() File "/home/eastein/newer_venv/local/lib/python2.7/site-packages/mediorc/init.py", line 101, in run self.client.ircobj.process_once(0.2) File "/home/eastein/newer_venv/local/lib/python2.7/site-packages/irc/client.py", line 261, in process_once self.process_data(i) File "/home/eastein/newer_venv/local/lib/python2.7/site-packages/irc/client.py", line 218, in process_data c.process_data() File "/home/eastein/newer_venv/local/lib/python2.7/site-packages/irc/client.py", line 575, in process_data for line in self.buffer: File "/home/eastein/newer_venv/local/lib/python2.7/site-packages/irc/buffer.py", line 94, in lines self.handle_exception() File "/home/eastein/newer_venv/local/lib/python2.7/site-packages/irc/buffer.py", line 92, in lines yield line.decode(self.encoding, self.errors) File "/home/eastein/newer_venv/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 174: invalid start byte

eastein commented 9 years ago

This may be something wrong with the irc module at 10.1

eastein commented 9 years ago

Try setting irc.client.ServerConnection.buffer_class = irc.buffer.LenientDecodingLineBuffer to avoid issues like this.

eastein commented 9 years ago

This unicode seems to successfully break

d̵́͢҉̩̟̜̹͔͇ẁ̠̣̞̞̪͘̕͞ĺ̦̼̙̳̬͞o̷̸̻̫̙̼̤͈͟͠c̶̡̬̖̮̯̳̱̟̕k̸̛̛̛̜̠͎̮͍͝s̢͙̖̘̦̼̻̟̕͠,̴͞҉̶̢̞̱̘̩̻ ̸̨͔͍͈̖͕͟͟͝h̴̥͍͕͇͕̜͈́͞e͏̴̮͖̟̖͔̗̲͝ ͏̰͚͙̹͕̬̼͘͞ç̴̵̞͔͎̳͕͟ͅo̡҉̨̥̘͇̱�

eastein commented 4 years ago

  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/eastein/venvs/andreybot/lib/python2.7/site-packages/andrey_bot/run.py", line 191, in <module>
    s.run()
  File "/home/eastein/venvs/andreybot/local/lib/python2.7/site-packages/mediorc/__init__.py", line 104, in run
    self.client.ircobj.process_once(0.2)
  File "/home/eastein/venvs/andreybot/local/lib/python2.7/site-packages/irc/client.py", line 244, in process_once
    self.process_data(i)
  File "/home/eastein/venvs/andreybot/local/lib/python2.7/site-packages/irc/client.py", line 201, in process_data
    c.process_data()
  File "/home/eastein/venvs/andreybot/local/lib/python2.7/site-packages/irc/client.py", line 572, in process_data
    for line in self.buffer:
  File "/home/eastein/venvs/andreybot/local/lib/python2.7/site-packages/irc/buffer.py", line 96, in lines
    self.handle_exception()
  File "/home/eastein/venvs/andreybot/local/lib/python2.7/site-packages/irc/buffer.py", line 94, in lines
    yield line.decode(self.encoding, self.errors)
  File "/home/eastein/venvs/andreybot/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 508-509: unexpected end of data```

Still running irc==10.1. Another similar problem... not the same though. I don't have the exact (seemingly invalid) unicode string that triggered this one.
eastein commented 4 years ago

Exception message from the most recent failure:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 508-509: unexpected end of data

I had modified /home/eastein/venvs/andreybot/local/lib/python2.7/site-packages/irc/buffer.py on the line before the call to line.decode to print a repr of the line variable. This is that repr:

':bjonnh[m]!bjonnhmatr@gateway/shell/matrix.org/x-ecjopyftwrirlcxi PRIVMSG #pumpingstationone :""The relentless pressure on TikTok ramped up further this week, with U.S. Secretary of State Mike Pompeo again claiming user data is sent to to China. \xe2\x80\x9cIt\xe2\x80\x99s not possible to have your personal information flow across a Chinese server,\xe2\x80\x9d he warned during a British media interview, suggesting that data would \xe2\x80\x9cend up in the hands of the Chinese Cmmunist Party,\xe2\x80\x9d which he characterized as an \xe2\x80\x9cevil empire.\xe2\x80'

The IRC protocol truncates without attention to character encoding, on a byteswise basis, to impose a maximum size of any message sent by a user to other users.

Here, an example of how smart quote 3 byte utf-8 encoded data will either decode appropriately or crash in this way, depending on where truncation occurred:

>>> print b'\xe2\x80'.decode("utf-8")                                                                    
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eastein/venvs/andreybot/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data
>>> 
eastein commented 4 years ago

In my IRC client (irssi) showing the message, the end of the line is shown as:

an “evil empire.��

@asl2 recommended "passing either ignore or replace to the decoder would fix it", I think it would be appropriate to set errors='replace'.

Example of that operating as expected instead of crashing:

>>> print b'an "evil empire.\xe2\x80'.decode("utf-8", errors='replace')
an "evil empire.�
bjonnh commented 3 years ago

Please prioritize this is really important for my sanity.