bbottema / outlook-message-parser

A Java parser for Outlook messages (.msg files)
76 stars 35 forks source link

Encoding issues with bodyHTML #49

Closed Faelean closed 2 years ago

Faelean commented 2 years ago

I have an issue that result in a problem similar to https://github.com/bbottema/outlook-message-parser/issues/34. When using the following code the string has a messed up encoding.

try (FileInputStream fileInputStream = new FileInputStream(msgFileName)) {
    OutlookMessageParser outlookMessageParser = new OutlookMessageParser();
    OutlookMessage outlookMessage = outlookMessageParser.parseMsg(msgFileName);

    System.out.println(outlookMessage.getBodyHTML());
}

This is an extract of what is returned:

<p class="MsoNormal"><span style="font-family:&quot;Arial&quot;,sans-serif">ich habe die AB geändert und Ihnen zugeschickt.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:&quot;Arial&quot;,sans-serif"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-family:&quot;Arial&quot;,sans-serif">Im Preis ist die Preiserhöhung ab dem 16.08.2021 enthalten.

From what I've gathered so far this is happens in this part of your code:

https://github.com/bbottema/outlook-message-parser/blob/4819d1f925a00828faf3c7375e129c785cb730ac/src/main/java/org/simplejavamail/outlookmessageparser/OutlookMessageParser.java#L453-L457

Modifying the code similar to what was suggested in #34 fixes the problem (at least for us) and from what we've seen with our test mails doesn't break any of them.

case 0x1e:
    // we put the complete data into a byte[] object...
    final byte[] textBytes1e = getBytesFromDocumentEntry(de);
    // ...and create a String object from it

    String convertedString = new String(textBytes1e, "ISO-8859-1");
    Pattern pattern = Pattern.compile("charset=(\"|)([\\w\\-]+)\\1", Pattern.CASE_INSENSITIVE);
    Matcher m = pattern.matcher(convertedString);
    if(m.find()) {
        try {
            convertedString = new String(textBytes1e, Charset.forName(m.group(2)));
        } catch (Exception e) {
            //ignore and use default charset
        }
    }
    return convertedString;

I'm currently trying to get example mails, I have one so far but I can not publish it here, so I'd have to send it to you directly and with the condition that it can't be published anywhere, including test cases. If you want I can send you this one.

bbottema commented 2 years ago

Nope no need. Fixed and released in 1.7.13. Cheers!