bbottema / outlook-message-parser

A Java parser for Outlook messages (.msg files)
76 stars 35 forks source link

Add support for parsing RTF email messages #16

Open fadeyev opened 4 years ago

fadeyev commented 4 years ago

As disscussed in https://github.com/bbottema/outlook-message-parser/pull/15 there are Outlook msg files that have only RTF body, which were created from RTF directly, not from HTML (you can create such email in Outlook by selecting FORMAT TEXT tab -> Format section -> Rich Text when creating a new message). Current parser doesn't parse such emails even closely to something readable.

To support this we need a generic RTF parser, which can parse generic RTF file and then convert it to HTML. It should handle handle all RTF formatting like \pard\plain \f0\b and convert it to HTML tags (like <div>, <span>, etc.) and style attributes (like font-size, font-family, etc.) Probably we can combine current parser and generic one written by kschroeer/rtf-html-java.

bbottema commented 4 years ago

Perfect. The change is probably actually on bbottema/rtf-to-html.

fadeyev commented 4 years ago

Ah, my bad, sorry - you can move the request to that project if you like.

bbottema commented 4 years ago

It's fine like this, no problem.

bbottema commented 4 years ago

I've had a talk with @kschroeer and he is willing to have his code merge with this code base into one cohesive solution. He did stress that he wants to make sure the solution is not tied to any other libraries to keep it as light-weight as possible, something I totally agree with.

Swing could be an optional dependency if people really would like to play with that option and I myself like to keep the option available for completeness sake.

Finally the result should be as you state in your opening: take kschroeer/rtf-html-java as a base, add the specifics of the RFC compliant converter, while defining defaults for non RTF-HTML elements.

Faelean commented 4 years ago

When viewing these two rtf mails

https://github.com/Sicos1977/MSGReader/blob/master/MsgReaderTests/SampleFiles/RtfSampleEmail.msg https://github.com/Sicos1977/MSGReader/blob/master/MsgReaderTests/SampleFiles/RtfSampleEmailWithAttachment.msg

I get the following as the textHTML (screenshot from the second one as the first contains way too much text):

image

Is this related to this enhancement or a separate issue?