Closed truls closed 5 years ago
Great, thanks so much for the work on this!
Do you have or can you create emails that trigger those conditions and send them to me in a tar file so I have something to test against?
Did you also test the PR with Python 2.7? I'm attempting to continue support for 2.7, but that may go away soon...
So I did some more testing and the PR works with 2.7 as well. Also, it seems like the encoding issues that I'm addressing only occurs with Python 3. The reason for this seems to be that the Python 3 version of the get_payload
function was rewritten to use the raw-unicode-escape
encoding as a fallback. See Python 3 and Python 2. So the Python 2 version essentially behaves like this PR, so maybe this is actually a Python 3 bug.
I've attached some emails (urlscan-test-body.tar.gz)
that I'm comfortable sharing. The first causes the anchor_stack
to empty and the rest triggers the encoding issues. Generating some email bodies that triggers this issue would also be fairly easy.
Thanks so much for finding and fixing this!! Worked perfectly.
You're welcome and thanks for merging!
I have been getting the following stack trace when opening some messages using urlscan
While the crashes that I experience always seem to originate from that line of code, there seem to be two different root causes both of which are addressed by this PR.
HTMLChunker
to empty. To avoid this, we guard the stack popping inhandle_endtag
to prevent the stacks from emptying. This is not a perfect workaround, but it is preferable to crashing.content-transfer-encoding
(8bit
orbinary
) we avoid the decoding machinery of themsg_payload
function. This is because themsg_payload
function will encode raw message bodies containing non-ascii characters using theraw-unicode-escape
encoding which cannot be subsequently decoded asutf-8
. This causes aUnicodeDecodeError
to be thrown in thedecode_bytes
function which means that the"Unable to deocde message"...
is passed on toextracthtmlurls
.