Message Parsing Fails with Japanese Characters

blackberryoctopus commented 10 years ago

@caio1982 @cmatsuoka

I just ran the script and did a mini investigation of the issue, but since Im not a python expert someone with more experience can probably solve more easily.

I believe the issue is related to the Japanese character/weird string encoding "\xe3?==?utf-8?Q?" ( present in the subject of the email )

I found the email in question by adding a print statement to the function _charset_decoder() in the utils.py at line 31

Here is the subject string: "Fwd: 黒田征太郎氏、from Nina"

Here is the error output from the script: [('Fwd: \xe9\xbb\x92\xe7\x94\xb0\xe5\xbe\x81\xe5\xa4\xaa\xe9\x83\x8e\xe6\xb0\x8f\xe3?==?utf-8?Q?\x80\x81from Nina', 'utf-8')] Traceback (most recent call last): File "./lpf.py", line 26, in <module> imap.lostphotosfound() File "/Users/gsogorka/Documents/Code/Lost-Photos-Found/lostphotosfound/server.py", line 150, in lostphotosfound header_subject = _charset_decoder(mail['Subject']) File "/Users/gsogorka/Documents/Code/Lost-Photos-Found/lostphotosfound/utils.py", line 46, in _charset_decoder header = header[0][0].decode(header[0][1]).encode('utf-8') File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe3 in position 23: invalid continuation byte

blackberryoctopus commented 10 years ago

@caio1982 @cmatsuoka any ideas on the above ?

caiobegotti commented 10 years ago

Hey there, thanks for testing it with such scenario, would you mind forwarding to me this particular mail (you can redact it to remove sensitive info, as long you keep one image attachment and the subject string!)? My e-mail is my username here AT gmail.com :-)

blackberryoctopus commented 10 years ago

@caio1982 ok, sending it to you now. thanks!

caiobegotti commented 10 years ago

Hey, back with some info. I just checked out the repository in a clean Mac OSX user's dir. This is the end of the whole process (it took a long while):

[...]
Skipping X-GM-MSDID 1463232482041249881
Skipping X-GM-MSDID 1463333516480657883
[Capital Adm Condomínios]: Proposta Administração Condominial
    ...2014-3-24_13-46-17_image001.png
[Caio Begotti <caio1982@gmail.com>]: Fwd: Proposta Administração Condominial
    ...2014-3-24_13-48-52_image001.png
Duplicated attachment /Users/caio1982/LostPhotosFound/caio1982@gmail.com/2014-3-24_13-48-52_image001.png (0eb170710b00dde81fbcbb0771e24b3e91bed1f4)
["Apoio Olhares" <apoio@olhares.com>]: RE: remover conta "caio1982"
    ...2014-3-26_15-28-12_image001.jpg
    ...2014-3-26_15-28-12_image002.jpg
    ...2014-3-26_15-28-12_image003.jpg
[G <blackberryoctopus@gmail.com>]: Fwd: 黒田征太郎氏、from Nina
    ...2014-3-26_12-36-6_sensoudouwa.gif
All done, see directory ~/LostPhotosFound for all the treasure we found for you :-)
~/Lost-Photos-Found ▶

It means something else is wrong on your computer as your message shows fine the japanese chars :-( Is it Linux? What distribution and version? OSX? Which release? What's your Python version? What's your locale variables? Mine are as follow (command "env" will tell you that):

LC_ALL=en_US.UTF-8 LANG=en_US.UTF-8 LC_CTYPE=UTF-8

caiobegotti commented 10 years ago

It's me again. Just tested the current code on my Debian Jessie (Testing) box, worked just liked on OSX. Let me know the details of your system as I mentioned earlier, so I can take a better look at it :-)

blackberryoctopus commented 10 years ago

Hi @caio1982 ,

thanks for investigating.

Here are my system environment variables:

Mac OS X 10.9.2 Python 2.7.5

rvm_bin_path=/Users/gsogorka/.rvm/bin
TERM_PROGRAM=Apple_Terminal
LESS_TERMCAP_md=
GEM_HOME=/Users/gsogorka/.rvm/gems/ruby-1.9.3-p429
LESS_TERMCAP_me=
TERM=xterm-color
SHELL=/bin/bash
IRBRC=/Users/gsogorka/.rvm/rubies/ruby-1.9.3-p429/.irbrc
TMPDIR=/var/folders/nl/4ddd8qn5259b3jgwwzph5d88pcjnnn/T/
Apple_PubSub_Socket_Render=/tmp/launch-kPJavK/Render
TERM_PROGRAM_VERSION=326
MY_RUBY_HOME=/Users/gsogorka/.rvm/rubies/ruby-1.9.3-p429
LESS_TERMCAP_ue=
TERM_SESSION_ID=FB9B7C83-525F-4B41-BC30-9C3E1931458E
USER=gsogorka
rvm_path=/Users/gsogorka/.rvm
SSH_AUTH_SOCK=/tmp/launch-BTqzU3/Listeners
__CF_USER_TEXT_ENCODING=0x2CC8D6B5:0:0
LESS_TERMCAP_us=
rvm_prefix=/Users/gsogorka
__CHECKFIX1436934=1
PATH=/Users/gsogorka/.rvm/gems/ruby-1.9.3-p429/bin:/Users/gsogorka/.rvm/gems/ruby-1.9.3-p429@global/bin:/Users/gsogorka/.rvm/rubies/ruby-1.9.3-p429/bin:/Users/gsogorka/.rvm/bin:/usr/local/share/npm/lib/node_modules/coffee-script/bin:/usr/local/sbin:/Users/gsogorka/bin:/usr/local/bin:/Developer/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/opt/X11/bin:/usr/local/MacGPG2/bin
LESSEDIT=mate -l %lm %f
IRCSERVER=irc.freenode.net
PWD=/Users/gsogorka/Box Sync/Product Management
LANG=en_US.UTF-8
IRCNAME=BISSAINTHE
IRCUSER=BISSAINTHE
COM_GOOGLE_CHROME_FRAMEWORK_SERVICE_PROCESS/USERS/GSOGORKA/LIBRARY/APPLICATION_SUPPORT/GOOGLE/CHROME_SOCKET=/tmp/launch-iTHQ8A/ServiceProcessSocket
rvm_env_string=ruby-1.9.3-p429
rvm_version=1.20.10 (stable)
IRCNICK=octoberry
HOME=/Users/gsogorka
SHLVL=1
rvm_ruby_string=ruby-1.9.3-p429
LOGNAME=gsogorka
GEM_PATH=/Users/gsogorka/.rvm/gems/ruby-1.9.3-p429:/Users/gsogorka/.rvm/gems/ruby-1.9.3-p429@global
LESS_TERMCAP_so=
DISPLAY=/tmp/launch-Y5CIIF/org.macosforge.xquartz:0
RUBY_VERSION=ruby-1.9.3-p429
LESS_TERMCAP_se=
_=/usr/bin/env```

jhurliman commented 10 years ago

I just tried running this app for the first time and while it successfully retrieved a lot of images, it eventually failed with a similar error.

Traceback (most recent call last):
  File "./lpf.py", line 26, in <module>
    imap.lostphotosfound()
  File "/Users/johnh/Code/Lost-Photos-Found/lostphotosfound/server.py", line 154, in lostphotosfound
    header_subject = _charset_decoder(mail['Subject'])
  File "/Users/johnh/Code/Lost-Photos-Found/lostphotosfound/utils.py", line 42, in _charset_decoder
    header = header[0][0].decode(header[0][1]).encode('utf-8')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 76: invalid continuation byte

caiobegotti commented 10 years ago

Uh oh, would you forward me this very same mail, like @blackberryoctopus did with the other one?

blackberryoctopus commented 10 years ago

Hi @caio1982 ,

Were you able to find any similarities between the message I sent and the one @jhurliman identified ?

caiobegotti commented 10 years ago

Not really, I haven't received his forwarded message to debug it yet. I have an idea for a fix/workaround but I'd need the messages first.

jhurliman commented 10 years ago

Apologies for the delayed response. How can I find out which email it is choking on?

caiobegotti commented 10 years ago

Hello @jhurliman, try to update the code checked out with git pull and re-run ./lpf.py, I've added a temporary log call to print the message details to the screen so it's easier for you to identify which message is that. Once it fails please look for the original message and forward it to me if it's ok :-)

blackberryoctopus commented 10 years ago

@caio1982 I just pulled the latest 1.2 repo to my machine and re-ran the lpf.py from scratch. The script still chokes on the message I sent you originally in this thread.

The release note for 1.2 stated that additional log data would be printed for utf related items, but I don't see anything new except for the "LOG: [decoded header] 'a@b.com'" string.

Is there a flag I need to set for more verbose/helpful logging ?

caiobegotti commented 10 years ago

Sorry @blackberryoctopus but the only difference between our boxes is those two missing variables in your terminal: LC_CTYPE and LC_ALL (which could affect Python somehow, specially LC_ALL, though you got LANG right). My terminal is set to xterm256-color and Unicode UTF-8 in its settings by the way.

I'm still waiting for the failing message by @jhurliman so I can debug it further!

cmatsuoka commented 7 years ago

Possibly a generalization of 6b07fe6 could help here. We could add a function to utils to deal with encoding, add the workaround there and use it instead of .encode('utf-8').

caiobegotti commented 7 years ago

Can you @blackberryoctopus and @jhurliman try the latest code please? I am not getting errors myself anymore, thanks to changes @cmatsuoka landed.

jhurliman commented 7 years ago

Traceback (most recent call last):
  File "./lpf.py", line 40, in <module>
    imap.lostphotosfound()
  File "/Volumes/Storage/jhurliman/Code/Lost-Photos-Found/lostphotosfound/server.py", line 264, in lostphotosfound
    header_subject = _charset_decoder(mail['Subject'])
  File "/Volumes/Storage/jhurliman/Code/Lost-Photos-Found/lostphotosfound/utils.py", line 42, in _charset_decoder
    header = header[0][0].decode(header[0][1]).encode('utf-8')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 76: invalid continuation byte

caiobegotti / Lost-Photos-Found

Message Parsing Fails with Japanese Characters #5