Improved handling of multipart messages using 8bit content-transfer-encoding and stray end tag tolerance

truls commented 5 years ago

I have been getting the following stack trace when opening some messages using urlscan

Traceback (most recent call last):
  File "/usr/bin/urlscan", line 169, in <module>
    dedupe=args.dedupe)
  File "/usr/lib/python3.7/site-packages/urlscan/urlchoose.py", line 131, in __init__
    shorten=self.shorten)
  File "/usr/lib/python3.7/site-packages/urlscan/urlchoose.py", line 64, in process_urls
    for group, usedfirst, usedlast in extractedurls:
  File "/usr/lib/python3.7/site-packages/urlscan/urlscan.py", line 432, in msgurls
    for chunk in msgurls(part, urlidx):
  File "/usr/lib/python3.7/site-packages/urlscan/urlscan.py", line 442, in msgurls
    for chunk in extracthtmlurls(msg):
  File "/usr/lib/python3.7/site-packages/urlscan/urlscan.py", line 410, in extracthtmlurls
    c.feed(s)
  File "/usr/lib64/python3.7/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/lib64/python3.7/html/parser.py", line 163, in goahead
    self.handle_data(unescape(rawdata[i:j]))
  File "/usr/lib/python3.7/site-packages/urlscan/urlscan.py", line 200, in handle_data
    if self.anchor_stack[-1] is None:
IndexError: list index out of range

While the crashes that I experience always seem to originate from that line of code, there seem to be two different root causes both of which are addressed by this PR.

Some HTML message bodies simply contain too many end tags which causes the stacks in HTMLChunker to empty. To avoid this, we guard the stack popping in handle_endtag to prevent the stacks from emptying. This is not a perfect workaround, but it is preferable to crashing.
For message parts using a raw content-transfer-encoding (8bit or binary) we avoid the decoding machinery of the msg_payload function. This is because the msg_payload function will encode raw message bodies containing non-ascii characters using the raw-unicode-escape encoding which cannot be subsequently decoded as utf-8. This causes a UnicodeDecodeError to be thrown in the decode_bytes function which means that the "Unable to deocde message"... is passed on to extracthtmlurls.

firecat53 commented 5 years ago

Great, thanks so much for the work on this!

Do you have or can you create emails that trigger those conditions and send them to me in a tar file so I have something to test against?
Did you also test the PR with Python 2.7? I'm attempting to continue support for 2.7, but that may go away soon...

truls commented 5 years ago

So I did some more testing and the PR works with 2.7 as well. Also, it seems like the encoding issues that I'm addressing only occurs with Python 3. The reason for this seems to be that the Python 3 version of the get_payload function was rewritten to use the raw-unicode-escape encoding as a fallback. See Python 3 and Python 2. So the Python 2 version essentially behaves like this PR, so maybe this is actually a Python 3 bug.

I've attached some emails (urlscan-test-body.tar.gz) that I'm comfortable sharing. The first causes the anchor_stack to empty and the rest triggers the encoding issues. Generating some email bodies that triggers this issue would also be fairly easy.

firecat53 commented 5 years ago

Thanks so much for finding and fixing this!! Worked perfectly.

truls commented 5 years ago

You're welcome and thanks for merging!

firecat53 / urlscan

Improved handling of multipart messages using 8bit content-transfer-encoding and stray end tag tolerance #74