Sicos1977 / MSGReader

C# Outlook MSG file reader without the need for Outlook
http://sicos1977.github.io/MSGReader
MIT License
489 stars 168 forks source link

Subject special characters not properly displayed in .NET 8 (worked in .NET Framework 4.7.2) - same MSGReader version #398

Closed summerkitsune closed 5 months ago

summerkitsune commented 7 months ago

Describe the bug When you read an .msg Subject with MSGReader in a.NET 8 project, it fails to properly display special characters, and that is not what happens in the .NET Framework 4.7.2 project. In the latter, it works. Same MSGReader version.

To Reproduce I created 2 repositories that contain basically the same code (and same dependencies) - the only difference between these 2 repositories being the .NET versions. The version of MSGReader is the same

Links to the repositories:

The 2 repositories use the same .msg file (same md5 checksums). You can find the .msg file in the repositories.

The .msg file has been created like this:

Expected behavior I expected the special characters to be displayed properly, like in the .NET Framework 4.7.2 project.

Screenshots image

Desktop

I have this problem on my machine but I've also tried on another machine (still Windows) with the same results.

Additional context I encountered this problem while migrating a project from .NET Framework 4.7.2 to .NET 8.

Sicos1977 commented 7 months ago

Did you register the encoding providers?

Sicos1977 commented 7 months ago

https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.registerprovider?view=net-8.0

Sicos1977 commented 7 months ago

https://www.nuget.org/packages/System.Text.Encoding.CodePages/

Sicos1977 commented 7 months ago

This is something that works by default on the .net framework but is something you have to do yourself on other frameworks because they are cross platform.

summerkitsune commented 7 months ago

Thanks for your prompt answer :) I didn't register the encoding providers, but I think I tested doing that and that it didn't work

I tried it again just now:

image image

I then cleaned my solution and rebuilt, and then ran it again:

image

Sicos1977 commented 7 months ago

I never used MSGReader in .net8 so I'm a little bit blank about how to solve problems on this version of .net

You could try looking into this file --> https://github.com/Sicos1977/MSGReader/blob/master/MsgReaderCore/Rtf/Document.cs

And set a break-point on this line --> if (byteBuffer.Count > 0 && reader.TokenType != TokenType.EncodedChar)

And than look if the chars get decoded correctly

summerkitsune commented 7 months ago

Thanks, I will do that ^^

Sicos1977 commented 7 months ago

And did you solve it?

summerkitsune commented 7 months ago

Hi again, yes I found out why I am having this issue

Encoding.Default is Encoding.UTF8 or Encoding.Unicode in NET Core/3/4/5/6/7/8. In .NET Framework, Encoding.Default is the system's active code page. More info

In .NET 8, It decoded badly the message subject of ANSI messages because it was no longer using the system's active code page by default, but trying to directly decode using Encoding.Unicode. Decoding ANSI to Unicode leads to have these unrecognized characters in the string

I have forked your project, and added a fix on a branch created on the (commit) flag 5.5.5 (last release) You can find the proposed commit here

I have added 2 unit tests that cover this and ran successfully all other tests (in net462 and net8)

image

Sicos1977 commented 7 months ago

Nice you found it out... I also did not knew that Encoding.Default in net core defaults to UT8

summerkitsune commented 7 months ago

I'm not sure 100% if the way I fixed this is the best way to do it, probably .NET Framework 4.x users don't need this fix, also this fix depends on UTF.Unknown to detect the encoding so it is possible we could break .NET Framework users experience by trying to decode by default with that library

I thought that we could maybe wrap the "detection" part with a preprocessor directive (only doing it for 'netstandard' users)

Sicos1977 commented 7 months ago

Is it possible for you to sent me this msg file so that I can look into it to figure out a solution that works in both .net framework and newer versions? If so than please ZIP the MSG file before sending it to sicos2002@hotmail.com

summerkitsune commented 6 months ago

Hi, I sent you an email with what you asked

Sicos1977 commented 6 months ago

Thanks... I'll will look into it an try to figure out a solution that will work on all .net versions

summerkitsune commented 6 months ago

Hi Kees, have you had time to figure out a solution?

Sicos1977 commented 6 months ago

Hi Kees, have you had time to figure out a solution?

Sorry but totally forgot this issue... to busy with another project at the moment. I'll give it a new try this weekend.

summerkitsune commented 5 months ago

Hi Kees, have you had the occasion to give it a try?

andhadj commented 5 months ago

We are facing the same issue after upgrading from .NET6 to .NET8. After the upgrade some words in Greek are not decoded correctly.

Sicos1977 commented 5 months ago

This should be fixed in version 5.5.8, it is now using the MessageCodePage property to determine the used coding. It should probably always have been like this.

Sorry for the long delay but I was to busy with other things and kept finding reasons to ignore this problem :-)

andhadj commented 5 months ago

@Sicos1977 Our scenario is still not working. We have an email in msg format which contains greek characters in the body. We are retrieving the body through the BodyHtml property and the encoding for most words is wrong. I could email you the sample msg file if you can take a look into this.

image

Sicos1977 commented 5 months ago

This is probably another encoding issue that has nothing todo with the issue of the original poster because that issue is solved. I guess this is some kind of encoding problems that happens when the HTML is extracted from RTF but to be sure I have to see this message. If possible then ZIP the msg file and sent it to sicos2002@hotmail.com