dunnkers / eml-to-html

Tiny CLI tool that converts .eml email files to .html files
MIT License
21 stars 2 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 3444: invalid continuation byte #2

Open zaghadon opened 6 months ago

zaghadon commented 6 months ago

I ran this package on a folder containing exported eml files and I kept getting this error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 3444: invalid continuation byte

Traceback (most recent call last): File "/Users/user/projects/storytella/.venv/bin/eml-to-html", line 8, in <module> sys.exit(main()) File "/Users/user/projects/storytella/.venv/bin/eml_to_html.py", line 52, in main eml_to_html(file_path) File "/Users/user/projects/storytella/.venv/bin/eml_to_html.py", line 39, in eml_to_html message: Message = message_from_file(eml_file) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/email/__init__.py", line 54, in message_from_file return Parser(*args, **kws).parse(fp) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/email/parser.py", line 53, in parse data = fp.read(8192) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 3444: invalid continuation byte

zaghadon commented 6 months ago

I cloned the repo and tried to trace the error from the stack trace.

Then I ran grep -axv '.*' ./* to check for invalid UTF-8 Characters and I got:

./I just fired myself.eml:    ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ � [...] 
./I'm a liar. And an A-hole.eml:    ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ � [...] 
./One question.eml:    ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ � [...] 
./piching through email.eml:    ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ � [...] 
./TL DR is a disease.eml:    ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ � [...] 
./The 3rd guest.eml:    ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ � [...] 

This indicates that the eml files contain some none UTF-8 Characters therefore breaking the program execution.

Followed the advise by found on this Answer on a Stackoverflow Issue - python file open() throws exception for non utf-8 character:

It seems that anyone opening text files they get from other people and have no way to control or know in advance what is inside might be advised to use "latin-1" because there are no invalid byte values in Latin-1.

So I used this on the codebase and it worked.

I'll push this and open a pull request for comments and advice.