Sicos1977 / MSGReader

C# Outlook MSG file reader without the need for Outlook
http://sicos1977.github.io/MSGReader
MIT License
476 stars 168 forks source link

Problem reading an Outlook file saved in ANSI #328

Closed happybald closed 1 year ago

happybald commented 1 year ago

Describe the bug Problem reading an Outlook file saved in ANSI

To Reproduce Steps to reproduce the behavior: Open any outlook email file with a meeting that will contain special characters such as "öäéàèü".

  1. Save it as Outlook Message Format (*.msg) (Not Unicode!)
  2. Get it as base64 -> then, Convert.FromBase64String and MemoryStream -> Storage.Message(memoryStream).
  3. Have ������ in the name after reading.

How I get base64 in TypeScript

public toBase64 = (file: File) => new Promise<string>((resolve, reject) => {
    const reader = new FileReader();
    reader.readAsDataURL(file);
    reader.onload = () => resolve(reader.result!.toString()
      .replace('data:', '')
      .replace(/^.+,/, ''))
    reader.onerror = error => reject(error);
  });

How I read on C# side

var bytes = Convert.FromBase64String(base64);
using var memoryStream = new MemoryStream(bytes);
using var msg = new Storage.Message(memoryStream);

Expected behavior Reads from ANSI need to be fixed

Screenshots image image

Desktop (please complete the following information): Processor AMD Ryzen 7 5800X 8-Core Processor Installed RAM 32.0 GB System type 64-bit operating system, x64-based processor Windows Edition Windows 11 Pro N Version 22H2 OS build 22621.1265 Experience Windows Feature Experience Pack 1000.22638.1000.0 Google Chrome Version 110.0.5481.178 (Official Build) (64-bit) Locale: LCID 1033, en-US

Additional context .NET 6, C#10, MsgReqder 4.4.16

Outlook msg file: öäéàèü Test chars.zip

mastercs999 commented 1 year ago

I know this issue since we work together. Base64 encoding during transfer doesn't play any role. The failing code can be simplified to this:

using var msg = new Storage.Message(@"c:\000-Temp\öäéàèü Test chars.msg");
Console.WriteLine(msg.Subject);

Prints: ?????? Test chars

Sicos1977 commented 1 year ago

Can you sent me this msg file? If so then please ZIP it before sending it to sicos2002@hotmail.com

mastercs999 commented 1 year ago

The link is in the end of the issue description. But I also sent it to your emails. Thank you :)

SeRgI1982 commented 1 year ago

Is there any way to workaround it ? I have an email with a pound sign in a Subject and also I experience '?' while convert such .msg file

SeRgI1982 commented 1 year ago

Is there any way to workaround it ? I have an email with a pound sign in a Subject and also I experience '?' while convert such .msg file

In my case, probably file was encoded differently. I have started to play with fork of your repo and I have found that when I change to something like this:

case PropertyType.PT_STRING8:
                return GetStreamAsString(containerName, Encoding.UTF7);

in my Subject pound sign is visible correctly.

Of course, it is not a solution - only a hint.

I don't know how to detect it and provide correct encoding always - no matter how .msg file was saved.

Sicos1977 commented 1 year ago

Normally an MSG file can be encoded in 2 ways; ANSI or UNICODE. If the settings is UNICODE then every string inside the MSG files has to be used as unicode. If for whatever reason somebody makes a MSG files with mixed encodings in it then is is very hard to figure out if the retrieved string is correct.

Can you sent me the MSG file that is having this issue so that I can look into it to see if there is some way in detecting the encoding for the subject? If so then please ZIP the MSG file first before sending it to sicos2002@hotmail.com

kenjiuno commented 1 year ago

Hi. Here is a sample. Japanese ANSI and Unicode.zip

If we have Microsoft Office Outlook 2013 or such, we can switch msg export format of ANSI or Unicode by switching option: FILEOptionsMailSave messagesUse Unicode format

2023-03-24_21h29_47

Sicos1977 commented 1 year ago

Is this the problem?

image

If so then this is something nobody can fix for you because Chinese is a 2 byte char set and ANSI is 1 byte. You never are going to get this to work because it is technicly not possible to do this.

The only reason why the text is readable in the body is because HTML is a 1 byte charset and does some special encoding so that the HTML render engine knows it has to show a 2 byte char.

Why do you want to use ANSII anyway? In this case unicode is invented to fix an issue like this.

kenjiuno commented 1 year ago

Is this the problem?

Although I'm not OP, it is right.

We are just developers. This kind of problem will occur when we are going to apply msgreader against client's data through built products or software.

And I agree that detecting ANSI encoding cannot be resolved by reasonable way due to technical difficulty.

The possible way will be to open System.Text.Encoding to developers so that they can select their own ANSI encoding in their own responsibility.

mastercs999 commented 1 year ago

In our case the client drag and drops a meeting/email from outlook to the web solution. That creates a file in this format. We have no control over it.

Sicos1977 commented 1 year ago

In an MSG file there is a parameter that says in what format the streams are stored. If that parameters says ansii then you have to read all the streams as 1 byte encoded. There is just no way to fix the encoding issue.

Sicos1977 commented 1 year ago

I'm closing this issue because there is no propper way to detect the stream encodings if the correct enooding is not set.