Open bbottema opened 3 years ago
Thanks for this. I've done a bit of playing about with the code below using the sample RTF. Unless I am misunderstanding, the only one of the converters that actually puts some HTML paragraph tags into the HTML in place of \par is the JEditor converter. The others seem to produce a new line character.
public class RTFTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
RTF2HTMLConverter converter = RTF2HTMLConverterJEditorPane.INSTANCE;
RTF2HTMLConverter converter2 = RTF2HTMLConverterClassic.INSTANCE;
RTF2HTMLConverter converter3 = RTF2HTMLConverterRFCCompliant.INSTANCE;
String rtf = "{\\rtf1\\ansi\\ansicpg1252\\fromtext \\fbidis \\deff0{\\fonttbl\r\n" +
"{\\f0\\fswiss Arial;}\r\n" +
"{\\f1\\fmodern Courier New;}\r\n" +
"{\\f2\\fnil\\fcharset2 Symbol;}\r\n" +
"{\\f3\\fmodern\\fcharset0 Courier New;}}\r\n" +
"{\\colortbl\\red0\\green0\\blue0;\\red0\\green0\\blue255;}\r\n" +
"\\uc1\\pard\\plain\\deftab360 \\f0\\fs20 Hello there\\par\r\n" +
"\\par\r\n" +
"This is a plain text email. I'd like to keep \\par\r\n" +
"\\par\r\n" +
"The line breaks when using emailToEml()\\par\r\n" +
"\\par\r\n" +
"All the best\\par\r\n" +
"Andy\\par\r\n" +
"}";
String html = converter.rtf2html(rtf);
System.out.println("RTF2HTMLConverterJEditorPane: "+html);
String html2 = converter2.rtf2html(rtf);
System.out.println("RTF2HTMLConverterClassic: "+ html2);
String html3 = converter3.rtf2html(rtf);
System.out.println("RTF2HTMLConverterRFCCompliant: "+ html3);
}
}
RTF2HTMLConverterJEditorPane: <html>
<head>
<style>
<!--
-->
</style>
</head>
<body>
<p class=default>
<span style="color: #000000; font-size: 10pt; font-family: Arial">
Hello there
</span>
</p>
<p class=default>
<span style="color: #000000; font-size: 10pt; font-family: Arial">
</span>
</p>
<p class=default>
<span style="color: #000000; font-size: 10pt; font-family: Arial">
This is a plain text email. I'd like to keep
</span>
</p>
<p class=default>
<span style="color: #000000; font-size: 10pt; font-family: Arial">
</span>
</p>
<p class=default>
<span style="color: #000000; font-size: 10pt; font-family: Arial">
The line breaks when using emailToEml()
</span>
</p>
<p class=default>
<span style="color: #000000; font-size: 10pt; font-family: Arial">
</span>
</p>
<p class=default>
<span style="color: #000000; font-size: 10pt; font-family: Arial">
All the best
</span>
</p>
<p class=default>
<span style="color: #000000; font-size: 10pt; font-family: Arial">
Andy
</span>
</p>
</body>
</html>
RTF2HTMLConverterClassic: <html><body style="font-family:'Courier',monospace;font-size:10pt;">
Hello there
This is a plain text email. I'd like to keep
The line breaks when using emailToEml()
All the best
Andy
</body></html>
RTF2HTMLConverterRFCCompliant: Hello there
This is a plain text email. I'd like to keep
The line breaks when using emailToEml()
All the best
Andy
... which I realise now is exactly what you said in your initial issue description. However if I use a supposedly minimal RTF (from https://interglacial.com/rtf/) I see the same behaviour:
public class RTFTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
RTF2HTMLConverter converter = RTF2HTMLConverterJEditorPane.INSTANCE;
RTF2HTMLConverter converter2 = RTF2HTMLConverterClassic.INSTANCE;
RTF2HTMLConverter converter3 = RTF2HTMLConverterRFCCompliant.INSTANCE;
String rtf = "{\\rtf1\r\n" +
"{\\fonttbl {\\f0 Times New Roman;}}\r\n" +
"\\f0\\fs60 Hello, \\par World!\\par \r\n" +
"}";
String html = converter.rtf2html(rtf);
System.out.println("RTF2HTMLConverterJEditorPane: "+html);
String html2 = converter2.rtf2html(rtf);
System.out.println("RTF2HTMLConverterClassic: "+ html2);
String html3 = converter3.rtf2html(rtf);
System.out.println("RTF2HTMLConverterRFCCompliant: "+ html3);
}
}
RTF2HTMLConverterJEditorPane: <html>
<head>
<style>
<!--
-->
</style>
</head>
<body>
<p class=default>
<span style="color: #000000; font-size: 30pt; font-family: Times New Roman">
Hello,
</span>
</p>
<p class=default>
<span style="color: #000000; font-size: 30pt; font-family: Times New Roman">
World!
</span>
</p>
</body>
</html>
RTF2HTMLConverterClassic: <html><body style="font-family:'Courier',monospace;font-size:10pt;"> Hello,
World!
</body></html>
RTF2HTMLConverterRFCCompliant: Hello,
World!
I currently have no idea how to solve this. There's already support for /par codes, but it's works differently. It looks like it can be used in different ways to control paragraphs in RTF.
@fadeyev, any idea?
Here the \par code is replaced with a \n new line character.
In the test sample files both the simple and complex examples already have HTML tags encapsulated in the RTF before conversion. I don't know enough about RTF to be clear what the thinking behind using a \n is. I assume the sample files were an output from something which included encapsulated HTML in the RTF, however when Outlook plain text is generating the RTF it seems to be using \par something like a <p>
tag in HTML.
It also looks like the original RTF RFC is expecting a \pard ... \par pair to wrap around a paragraph, but the Outlook format seems to be using one \pard control word then a bunch of \par control words, so not so easy to do something like
switch (controlWord) {
case "pard":
append(result, "<p>", currentGroup);
break;
case "par":
append(result, "<\\p>", currentGroup);
break;
I had already tried to go that way, but the resulting HTML will blow up with empty paragraphs because \par seems to be used in metadata as well.
Fyi, the reason for producing \n characters is to produce readable HTML source code, rather than having everything on a single line. Same goes for \t for tabs.
I've used MS Word and Outlook to create some intentionally RTF format files (to see if the RTF format I have is a side effect of the "Outlook Plain Text is using RTF" thing), and I am seeing the same behaviour in these RTF files. RTF samples.zip
Looking at some other approaches to this problem, this JS based RTF to HTML library seems to be using <div>
tags, following a parse / render approach where I assume they treat \pard ... \par
or \par ... \par
as matching start end controls for the div. They also seem to have deliberately kept \r\n characters, possibly in order to preserve the readable source code. I'm not sure about the \par in the metadata, will need to look into that a bit more, but the library seems to handle all the examples including your original tests and the MS files.
Out of interest, where did the sample RTF in your test files come from? Also which RFC is the RFCCompliant converter based on?
Thanks, Andy
Out of interest, where did the sample RTF in your test files come from? Also which RFC is the RFCCompliant converter based on?
Most of the RTF's came from the Outlook messages when this project was still part of outlook-message-parser. The RFC compliant implementation was graciously provided by @fadeyev, which is why I paged him earlier.
I added a junit test to see how newlines are treated currently.
Result:
I tried looking into it some more, but I really can't make cheese out of it -as we say it in Dutch. I think there is a fundamental step missing in the parser where /par is treated is differently based on the current level nesting or something. But it's not just /par, your other examples don't work at all. Not even the bold text in there...
In Britain this is "can't make head nor tail of it"!
Am I right that the fundamental step missing in the parser is what @fadeyev mentions in https://github.com/bbottema/outlook-message-parser/issues/16 ?
This is really starting to cause us some pain now, as when our customers upload MSG files they usually don't know if whoever sent them the email was using FORMAT TEXT tab -> Format section -> Rich Text. Currently our helpdesk is telling people to click Forward, switch the mail to HTML, then save the unsent forwarded mail as .MSG and upload, but this is four or five steps instead of drag and drop straight from Outlook. Often they just get garbled email import and complain. Is there any chance you could implement one of the following:
1) Implementing the merge of kschroeer/rtf-html-java as mentioned in https://github.com/bbottema/outlook-message-parser/issues/16#issue-509585834 2) Implementing the property mentioned in https://github.com/bbottema/simple-java-mail/issues/317#issuecomment-852723051 to allow us to use the Swing converter (I know this is sub-optimal but at the moment in our experience we don;'t have a working converter at all - all our RTF comes from Outlook email and all is broken on conversion) 3) Implementing some way we can identify if a MSG file is RTF before converting to an Email object, or in the Email object, so we can choose to use getPlainText() rather than getHTMLText() on these emails and avoid the garbled formatting.
Thanks very much for your help on this.
Andy
Any progress or plans on this issue?
No, and I don't think that will change any time soon.
The following RTF doesn't convert newlines properly to paragraphs. Only the legacy JEditorPane (which shouldn't be used for a whole slew of other reasons) recognized these and produces HTML paragraphs properly.
It looks like the \par control codes are not parsed properly. I'm surprised by this problem since we have lots of test cases with newlines already, which are all fine.