Closed elylv closed 1 year ago
It probably has something todo with missing coding providers in .NET 6. I have to make some test projects to see why it is going wrong.
Nope, even after installing System.Text.Encoding.CodePages
package, we are seeing this issue
Hi @Sicos1977, testing with the latest version is still giving us the same garbled text in specific circumstances. It seems emails originating from our Japanese-native speakers' systems exhibit the problem, but when edited or reforwarded by our (English-language system) devs, it seems to 'correct' the issue. Are we able to directly email you some .msg files with examples of the problem?
That would be nice, please ZIP te files before sending them to sicos2002@hotmail.com. It is very important that you ZIP them otherwise hotmail wil convert them to EML format and make them useless for me.
I have received the e-mail I'll try to look into it this evening.
At the moment I'm a little bit busy with some other project, I'll try to look into your mails this week or next week.
I have rewritten the code that extract HTML from RTF (de-encapsulation) .... please try this version (version 5.0.0 on nuget) and see if this fixes your issue. The previous extractor was somewhat a mess because of patch on patch on patch... etc... I also finally found some good Microsoft documentation about how to extract HTML from RTF
Should all be fixed now, it was god damn hard to figure out how Outlook is de-encapsulating encoded chars from a RTF file ... but I think I figured it out. Please try version 5.1.0 and if it works correctly then buy me a nice cup of coffee.
Thanks for your help, this has resolved a lot of the issues, however we are still seeing some issues with Japanese emails. I will send you some further examples.
I did some more enhancements on the encoding issues. I added an encoding detector to the project that tries to detect the encoding of a byte array when mixed encodings are used. It is the only possible solution that I can think of that for example Aspose is using the get the encoding correct. I searched through the RTF specifications if I did something wrong when trying to determine the correct encoding but I really can't seem to find anything that I could possible do wrong.
Just get the latest version from GitHub and try this on some of your problem e-mails. I still have to do some cleanup in the current codebase before I release a new version.
``When grabbing the HtmlBody of Outlook emails with Japanese text, we often get a mixture of correct text and unicode placeholder characters: ��
This has been an issue for a number of versions (still appears on latest 4.5.2), and .NET 6/7. I have had word from some of our other developers that the issue does NOT appear on .NET Framework 4.7.2
Example message, how it appears in Outlook:
HTML Body output of message:
How this appears (with Unicode placeholders hardcoded into the text):
Here is the body text of the email as typed in Outlook: