Garbled text in emails with Japanese script

elylv commented 1 year ago

``When grabbing the HtmlBody of Outlook emails with Japanese text, we often get a mixture of correct text and unicode placeholder characters: ��

This has been an issue for a number of versions (still appears on latest 4.5.2), and .NET 6/7. I have had word from some of our other developers that the issue does NOT appear on .NET Framework 4.7.2

Example message, how it appears in Outlook:

HTML Body output of message:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:DengXian;
    panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
    {font-family:"\@DengXian";
    panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
    {font-family:"MS PGothic";
    panose-1:2 11 6 0 7 2 5 8 2 4;}
@font-face
    {font-family:"\@MS PGothic";}
@font-face
    {font-family:Meiryo;}
@font-face
    {font-family:"Meiryo UI";}
@font-face
    {font-family:"\@Meiryo UI";}
@font-face
    {font-family:"\@Meiryo";}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif;
    mso-fareast-language:JA;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
    {mso-style-priority:99;
    mso-style-link:"Plain Text Char";
    margin:0in;
    font-size:10.0pt;
    font-family:"Meiryo UI",sans-serif;
    mso-fareast-language:JA;}
span.EmailStyle17
    {mso-style-type:personal-compose;
    font-family:"Calibri",sans-serif;
    color:windowtext;}
span.PlainTextChar
    {mso-style-name:"Plain Text Char";
    mso-style-priority:99;
    mso-style-link:"Plain Text";
    font-family:"Meiryo UI",sans-serif;
    mso-fareast-language:JA;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-family:"Calibri",sans-serif;}
@page WordSection1
    {size:8.5in 11.0in;
    margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link="#0563C1" vlink="#954F72" style='word-wrap:break-word'><div class=WordSection1><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Meiryo UI",sans-serif'>ABC<span lang=JA>ご担当者様</span></span><span style='font-size:10.5pt'><o:p></o:p></span>

</p><p class=MsoNormal><span style='font-size:10.5pt'><o:p>&nbsp;</o:p></span>

</p><p class=MsoNormal><span lang=JA style='font-size:10.5pt;font-family:"Meiryo UI",sans-serif'>お疲れ様です。</span><span style='font-size:10.5pt'><o:p></o:p></span>

</p><p class=MsoNormal><span style='font-size:10.5pt'>blahblahblah)</span><span lang=JA style='font-size:10.5pt;font-family:"Meiryo UI",sans-serif'>へ</span><span style='font-size:10.5pt'><o:p></o:p></span>

</p><p class=MsoPlainText>Q#<span style='font-family:"Meiryo",sans-serif'>123450</span><span lang=JA style='font-size:10.5pt'>��</span><o:p></o:p>

</p><p class=MsoNormal><span lang=JA style='font-size:10.5pt;font-family:"Meiryo UI",sans-serif'>標準構成掲載をお願いいたします。</span><span style='font-size:10.5pt;font-family:"Meiryo UI",sans-serif'><o:p></o:p></span>

</p><p class=MsoNormal><span style='font-size:10.5pt'><o:p>&nbsp;</o:p></span>

</p><p class=MsoNormal><span lang=JA style='font-size:10.5pt;font-family:"Meiryo UI",sans-serif'>何卒よろしくお願いいたします。</span><span style='font-size:10.5pt'><o:p></o:p></span>

</p><p class=MsoNormal><o:p>&nbsp;</o:p>

</p><br />
<p class=msipfooter90245289 align="Left" style="margin:0"><span style='font-size:7.0pt;font-family:Calibri;color:#737373'>Internal Use - Confidential</span>

</p></div></body></html>

How this appears (with Unicode placeholders hardcoded into the text):

Here is the body text of the email as typed in Outlook:

ABCご担当者様 

お疲れ様です。 
blahblahblah)へ 
Q#123450を 
標準構成掲載をお願いいたします。 

何卒よろしくお願いいたします。

Sicos1977 commented 1 year ago

It probably has something todo with missing coding providers in .NET 6. I have to make some test projects to see why it is going wrong.

BeerendraMC commented 1 year ago

Nope, even after installing System.Text.Encoding.CodePages package, we are seeing this issue

elylv commented 1 year ago

Hi @Sicos1977, testing with the latest version is still giving us the same garbled text in specific circumstances. It seems emails originating from our Japanese-native speakers' systems exhibit the problem, but when edited or reforwarded by our (English-language system) devs, it seems to 'correct' the issue. Are we able to directly email you some .msg files with examples of the problem?

Sicos1977 commented 1 year ago

That would be nice, please ZIP te files before sending them to sicos2002@hotmail.com. It is very important that you ZIP them otherwise hotmail wil convert them to EML format and make them useless for me.

Sicos1977 commented 1 year ago

I have received the e-mail I'll try to look into it this evening.

Sicos1977 commented 1 year ago

At the moment I'm a little bit busy with some other project, I'll try to look into your mails this week or next week.

Sicos1977 commented 1 year ago

I have rewritten the code that extract HTML from RTF (de-encapsulation) .... please try this version (version 5.0.0 on nuget) and see if this fixes your issue. The previous extractor was somewhat a mess because of patch on patch on patch... etc... I also finally found some good Microsoft documentation about how to extract HTML from RTF

Sicos1977 commented 1 year ago

Should all be fixed now, it was god damn hard to figure out how Outlook is de-encapsulating encoded chars from a RTF file ... but I think I figured it out. Please try version 5.1.0 and if it works correctly then buy me a nice cup of coffee.

elylv commented 1 year ago

Thanks for your help, this has resolved a lot of the issues, however we are still seeing some issues with Japanese emails. I will send you some further examples.

Sicos1977 commented 1 year ago

I did some more enhancements on the encoding issues. I added an encoding detector to the project that tries to detect the encoding of a byte array when mixed encodings are used. It is the only possible solution that I can think of that for example Aspose is using the get the encoding correct. I searched through the RTF specifications if I did something wrong when trying to determine the correct encoding but I really can't seem to find anything that I could possible do wrong.

Just get the latest version from GitHub and try this on some of your problem e-mails. I still have to do some cleanup in the current codebase before I release a new version.

Sicos1977 / MSGReader

Garbled text in emails with Japanese script #333