bbottema / outlook-message-parser

A Java parser for Outlook messages (.msg files)
76 stars 35 forks source link

Wrong encoding for bodyHTML #34

Closed Faelean closed 4 years ago

Faelean commented 4 years ago

If an email contains bodyHTML (mapi 0x1013) that is encoded in for example UTF-8 the parser ignores the encoding and uses CP1252 causing characters like ü being displayed as ü.

https://github.com/bbottema/outlook-message-parser/blob/5a6b5d248b37e70c8ad4280194ff612a497ad9ff/src/main/java/org/simplejavamail/outlookmessageparser/model/OutlookMessage.java#L252-L264

Problem is that the correct charset is not known when calling the String constructor. There might be a way to do this more efficient but this is what we've come up with to replace Line 259:

String convertedString = new String((byte[]) value, CharsetHelper.WINDOWS_CHARSET);
Pattern pattern = Pattern.compile("charset=(\"|)([\\w\\-]+)\\1", Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(convertedString);
if(m.find()) {
    try {
        convertedString = new String((byte[]) value, Charset.forName(m.group(2)));
    } catch (Exception e) {
        //ignore and use default charset
    }
}
return convertedString;

First step, convert everything as before. Second step, check the result String for a charset. The regex matches the following two pattern and extracts the charset:

<meta charset="utf-8" /> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

If there is a charset in the result String overwrite while using the correct charset, else use the already created String. The try/catch block is for the Charset.forName method in case someone messed up the charset in the bodyHTML.

bbottema commented 4 years ago

Do you have an .msg for me with an HTML body? I have been unable to produce one, all the emails I save with Outlook are converted to RTF format in the .msg files.

Faelean commented 4 years ago

I'm sorry but I don't have any that I can share. We also haven't been able to create .msg files that have these problems, but the ones provided to us contain private information so I'm not allowed to share them.

derrohrbach commented 4 years ago

Maybe this msg file was produced by an Exchange server directly? Or maybe in an older version of outlook.

bbottema commented 4 years ago

Until I get a sample I can't do anything on my end. I tried googling some public examples, but came up empty.

Faelean commented 4 years ago

I've managed to get an example mail, but they do not want this mail to be public. I can sent it to you privately, but you can not upload it to your test resources in this git repository. If you're ok with this I can mail it to you, otherwise I'd have to try and get another one.

bbottema commented 4 years ago

Excellent, of course I agree to those terms. Thank you.

bbottema commented 4 years ago

Fixed in 1.7.7.

Btw, I'm using UTF8 as default now rather than the Windows encoding. Still the detection logic is still very useful for some exotic encodings like some chinese character sets.