GAM-team / got-your-back

Got Your Back (GYB) is a command line tool for backing up your Gmail messages to your computer using Gmail's API over HTTPS.
https://github.com/GAM-team/got-your-back/wiki
Apache License 2.0
2.56k stars 203 forks source link

Unhandled exception when cleaning message with unicode/emoji in (From:) headers. #433

Open Leftium opened 11 months ago

Leftium commented 11 months ago

Full steps to reproduce the issue:

  1. Backup email with message that ~is not saved in UTF8 format~ has unicode/emoji in From: header.
  2. Restore email using --cleanup.

Expected outcome: GYB gracefully handles unicode/emoji in headers, either:

Actual outcome: GYB exits with unhandled exception:

Traceback (most recent call last):166783)
  File "gyb.py", line 2767, in <module>
  File "gyb.py", line 2239, in main
  File "gyb.py", line 1947, in message_hygiene
  File "gyb.py", line 1891, in cleanup_from
  File "email\utils.py", line 215, in parseaddr
  File "email\_parseaddr.py", line 517, in __init__
  File "email\_parseaddr.py", line 260, in getaddrlist
TypeError: object of type 'Header' has no len()
[31420] Failed to execute script 'gyb' due to unhandled exception!

Work-around:

Suggested alternative fix: always convert non UTF8 files to UTF8 when saving backup.

Notes:

Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> f = open('2021/8/14/17b437668e8b5c17.eml', 'rb')
>>> bytes = f.read()
>>> m = email.message_from_bytes(bytes)
>>> m['to']
'J***********y<j***@l*****m.com>'
>>> m['from']
<email.header.Header object at 0x000002B33DED8410>
>>> len(m['from'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'Header' has no len()

Mangled text:

From: "(주)한웰이쇼핑"<help@daisomall.co.kr>
To: J***********y<j***@l*****m.com>
Subject: [´ÙÀ̼Ҹô] °³ÀÎÁ¤º¸ À¯È¿±â°£Á¦¿¡ µû¸¥ ÈÞ¸é°èÁ¤ Àüȯ ¾È³»µå¸³´Ï´Ù.

Proper text:

From: "(주)한웰이쇼핑" <help@daisomall.co.kr>
To: "J***********y" <j***@l*****m.com>
Subject: [다이소몰] 개인정보 유효기간제에 따른 휴면계정 전환 안내드립니다.
Leftium commented 11 months ago

update: This issue isn't limited to non-UTF8 files.

Some UTF8 encoded files also throw this exception. For example, if the From header has emoji:

From:🔥Keto_Rapid_Diet🔥 <xafnsbqsmgniwdztev@twhzbt.drivefact.org>

There were also more emails from the the Korean address (From: "(주)한웰이쇼핑" <help@daisomall.co.kr>) that failed to restore even after converting the .eml file to UTF8 and ensuring there were no mangled characters.

The best work-around seems to be to rename these .eml files so gyb skips them.

Leftium commented 11 months ago

I modified my gyb.py to catch these exceptions, printing the problem message info and continuing with the remaining messages:

  if options.cleanup:
      try:
          full_message = message_hygiene(full_message)
      except TypeError as error:
          print(
              f'WARNING! error cleaning message {message_num} ({message_filename})')
          print(f'  {error}')
          print(f'  this message will be skipped.')
          continue

Compare to original code.

Leftium commented 11 months ago

Got the fix on StackOverflow: policy=email.policy.SMTPUTF8

I confirmed Korean was restored without mangling, but the emoji ended up being mangled. Perhaps because the emoji from name not wrapped in quotes? Not a big deal since emoji was from a spam email.

def message_hygiene(msg):
    '''Ensure Message-Id, Date and From headers are valid. Replace if not.'''
    omsg = email.message_from_bytes(msg, policy=email.policy.SMTPUTF8)
    orig_id = omsg['message-id']
    orig_date = omsg['date']
    orig_from = omsg['from']