GAM-team / got-your-back

Got Your Back (GYB) is a command line tool for backing up your Gmail messages to your computer using Gmail's API over HTTPS.
Apache License 2.0
2.56k stars 203 forks source link

Unhandled exception when cleaning message with unicode/emoji in (From:) headers. #433

Open Leftium opened 11 months ago

Leftium commented 11 months ago

Full steps to reproduce the issue:

  1. Backup email with message that ~is not saved in UTF8 format~ has unicode/emoji in From: header.
  2. Restore email using --cleanup.

Expected outcome: GYB gracefully handles unicode/emoji in headers, either:

Actual outcome: GYB exits with unhandled exception:

Traceback (most recent call last):166783)
  File "", line 2767, in <module>
  File "", line 2239, in main
  File "", line 1947, in message_hygiene
  File "", line 1891, in cleanup_from
  File "email\", line 215, in parseaddr
  File "email\", line 517, in __init__
  File "email\", line 260, in getaddrlist
TypeError: object of type 'Header' has no len()
[31420] Failed to execute script 'gyb' due to unhandled exception!


Suggested alternative fix: always convert non UTF8 files to UTF8 when saving backup.


Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> f = open('2021/8/14/17b437668e8b5c17.eml', 'rb')
>>> bytes =
>>> m = email.message_from_bytes(bytes)
>>> m['to']
>>> m['from']
<email.header.Header object at 0x000002B33DED8410>
>>> len(m['from'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'Header' has no len()

Mangled text:

From: "(주)한웰이쇼핑"<>
To: J***********y<j***@l*****>
Subject: [´ÙÀ̼Ҹô] °³ÀÎÁ¤º¸ À¯È¿±â°£Á¦¿¡ µû¸¥ ÈÞ¸é°èÁ¤ Àüȯ ¾È³»µå¸³´Ï´Ù.

Proper text:

From: "(주)한웰이쇼핑" <>
To: "J***********y" <j***@l*****>
Subject: [다이소몰] 개인정보 유효기간제에 따른 휴면계정 전환 안내드립니다.
Leftium commented 11 months ago

update: This issue isn't limited to non-UTF8 files.

Some UTF8 encoded files also throw this exception. For example, if the From header has emoji:

From:🔥Keto_Rapid_Diet🔥 <>

There were also more emails from the the Korean address (From: "(주)한웰이쇼핑" <>) that failed to restore even after converting the .eml file to UTF8 and ensuring there were no mangled characters.

The best work-around seems to be to rename these .eml files so gyb skips them.

Leftium commented 11 months ago

I modified my to catch these exceptions, printing the problem message info and continuing with the remaining messages:

  if options.cleanup:
          full_message = message_hygiene(full_message)
      except TypeError as error:
              f'WARNING! error cleaning message {message_num} ({message_filename})')
          print(f'  {error}')
          print(f'  this message will be skipped.')

Compare to original code.

Leftium commented 11 months ago

Got the fix on StackOverflow: policy=email.policy.SMTPUTF8

I confirmed Korean was restored without mangling, but the emoji ended up being mangled. Perhaps because the emoji from name not wrapped in quotes? Not a big deal since emoji was from a spam email.

def message_hygiene(msg):
    '''Ensure Message-Id, Date and From headers are valid. Replace if not.'''
    omsg = email.message_from_bytes(msg, policy=email.policy.SMTPUTF8)
    orig_id = omsg['message-id']
    orig_date = omsg['date']
    orig_from = omsg['from']