awdeorio / mailmerge

A simple, command line mail merge tool.
MIT License
140 stars 41 forks source link

Chardet specifies wrong encoding with emoji #61

Closed seshrs closed 4 years ago

seshrs commented 4 years ago

Tl;dr

It looks like chardet doesn't always get the encoding right — in a message containing emoji, chardet reported (with low confidence) that the message was encoded in Windows-1252. Python throws an exception when trying to encode a string with emoji to that charset.

I don't know if there's a good fix. Can we assume that users are responsible for specifying an encoding if it's not UTF-8? (That would remove our reliance on chardet.)

The bug

Steps I followed on my Mac:

  1. Use the following mailmerge_template.txt:
    
    TO: {{email}}
    SUBJECT: Testing mailmerge
    FROM: My Self <myself@mydomain.com>

Hi 😀

2. Run `mailmerge --dry-run --limit 1`

I received this error:

... File "/Users/seshrs/Documents/Git/mailmerge/mailmerge/template_message.py", line 76, in _transform_encoding part.set_charset(encoding) File "/Users/seshrs/Documents/Git/mailmerge/env/lib/python3.7/site-packages/future/backports/email/message.py", line 322, in set_charset self._payload = charset.body_encode(self._payload) File "/Users/seshrs/Documents/Git/mailmerge/env/lib/python3.7/site-packages/future/backports/email/charset.py", line 403, in body_encode string = string.encode(self.output_charset) File "/usr/local/bin/../Cellar/python/3.7.4_1/bin/../Frameworks/Python.framework/Versions/3.7/lib/python3.7/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f600' in position 3: character maps to


## The cause
I set a breakpoint in the `_transform_encoding` function:
https://github.com/awdeorio/mailmerge/blob/3a78fbd4372cd67e85cf8930288fbd0e2f2adfc1/mailmerge/template_message.py#L66-L74

And printed the contents of `detected`:

(Pdb++) detected {'encoding': 'Windows-1252', 'confidence': 0.5334615384615384, 'language': ''}


I had expected 'utf-8', not 'Windows-1252'. To confirm this was the issue, I tried executing the following in Python:

"hi 😀".encode('Windows-1252') Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f600' in position 3: character maps to

(The call to part.set_charset(encoding) eventually reaches this line in the python-future code that executes something like the above.)

awdeorio commented 4 years ago

I added two tests, one with plain emoji, then other using the Markdown rendered with emoji bugfix/61-emoji.

It would be nice to avoid using the chardet library altogether. This library is the root cause of Issue #46, too.

One idea would be to check if any characters are outside the range [0, 127]. Then, set the encoding to either ascii or utf8.