firecat53 / urlscan

Mutt and terminal url selector (similar to urlview)
GNU General Public License v2.0
214 stars 38 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2470: invalid start byte #14

Closed svenXY closed 9 years ago

svenXY commented 9 years ago

Hi,

sometimes, I get the following exception:

I'm using python 3.4 and urlscan 0.7.2 (f97b90e) from mutt

Traceback (most recent call last):
  File "/usr/bin/urlscan", line 100, in <module>
    msg = parser().parse(sys.stdin)
  File "/usr/lib/python3.4/email/parser.py", line 54, in parse
    data = fp.read(8192)
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2470: invalid start byte

Anything I can do here to provide more information?

firecat53 commented 9 years ago

Can you please post a message or message snippet that causes it to fail?

Thanks, Scott

svenXY commented 9 years ago
Date: Wed, 26 Nov 2014 18:46:02 +0100                                              
From: Intranet Team <intranet@example.com>                                         
To: me <me@example.com>                                                            
Subject: some nice subject                                                         

Hallo,                                                                             

Schöne Grüße,                                                                                                                                                                                                                                  
dein Orgateam
svenXY commented 9 years ago

spits out (for me here):

> cat message.stripped| urlscan
Traceback (most recent call last):
  File "/usr/bin/urlscan", line 100, in <module>
    msg = parser().parse(sys.stdin)
  File "/usr/lib/python3.4/email/parser.py", line 54, in parse
    data = fp.read(8192)
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 144: invalid start byte
firecat53 commented 9 years ago

Well, for some reason, the version you posted works just fine for me, both in python 2.7 and 3.4 (both on Arch). I know I've still got some work in figuring out UTF-8 actually in URLs, but this seems to be something different. Can you investigate further and see if there's perhaps another hidden character in the message you posted that was lost when pasting into Github?

svenXY commented 9 years ago

Hi, it seems that copy/pasting it fixes all problems and thus you cannot reproduce the problem.

Running

cat file | urlscan 

however shows the problem

Can I send you the file by mail or similar?

firecat53 commented 9 years ago

Absolutely. My email is in the README On Dec 5, 2014 12:13 AM, "Sven Hergenhahn" notifications@github.com wrote:

Hi, it seems that copy/pasting it fixes all problems.

Can I send you the file by mail or similar?

— Reply to this email directly or view it on GitHub https://github.com/firecat53/urlscan/issues/14#issuecomment-65759132.

svenXY commented 9 years ago

You should have received an email.

firecat53 commented 9 years ago

I received the email and am seeing the same error. I'll keep looking into this.

firecat53 commented 9 years ago

Please test some more and make sure I actually fixed this for you!

svenXY commented 9 years ago

Hi Scott,

Thanks for fixing it. I did a quick test with the message I sent you and it does indeed work. If I encounter further problems in real life usage, I'll let you know.

Good stuff! Sven

svenXY commented 9 years ago

Hi,

I have a different problem with a multipart mail, the problem only occurs within mutt, not after saving the whole multipart message, and not after only saving the html and txt parts.

I'm not sure how to give you some test code now, I'm afraid.

The error was:

Traceback (most recent call last):
  File "/usr/lib/python3.4/email/message.py", line 357, in set_charset
    cte(self)
TypeError: 'str' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/urlscan", line 118, in process_input
    msg.set_charset(c)
  File "/usr/lib/python3.4/email/message.py", line 365, in set_charset
    payload = payload.encode('ascii', 'surrogateescape')
AttributeError: 'list' object has no attribute 'encode'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.4/email/message.py", line 357, in set_charset
    cte(self)
TypeError: 'str' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/urlscan", line 127, in <module>
    msg = process_input(args.message)
  File "/usr/bin/urlscan", line 121, in process_input
    i.set_charset(c)
  File "/usr/lib/python3.4/email/message.py", line 365, in set_charset
    payload = payload.encode('ascii', 'surrogateescape')
AttributeError: 'list' object has no attribute 'encode'

I'll try to bounce you the message, but please keep it confidential.