bbottema / outlook-message-parser

A Java parser for Outlook messages (.msg files)
76 stars 35 forks source link

[Bug] Parsing lists to HTML has double bullet points #64

Closed piu130 closed 3 months ago

piu130 commented 1 year ago

When parsing lists in msgs, the resulting html (.getHTMLText()) contains the list tag as well as the bullet point in text. The browser renders a bullet point, because of the list tag and a second bullet point from the text.

Outlook: image Browser: image HTML:

<p class=MsoNormal>Stroke:<o:p></o:p></p>
<ul style='margin-top:0cm' type=disc>
  <li class=MsoListParagraph style='margin-left:0cm;mso-list:l1 level1 lfo1'>*  Stroke entry 1<o:p></o:p></li>
  <li class=MsoListParagraph style='margin-left:0cm;mso-list:l1 level1 lfo1'>*  Stroke entry 2<o:p></o:p></li>
</ul>
<p class=MsoNormal>Bullet:<o:p></o:p></p>
<ul style='margin-top:0cm' type=disc>
  <li class=MsoListParagraph style='margin-left:0cm;mso-list:l0 level1 lfo2'>*  Bullet entry 1<o:p></o:p></li>
  <li class=MsoListParagraph style='margin-left:0cm;mso-list:l0 level1 lfo2'>*  Bullet entry 2<o:p></o:p></li>
</ul>
<p class=MsoNormal>Number:<o:p></o:p></p>
<ol style='margin-top:0cm' start=1 type=1>
  <li class=MsoListParagraph style='margin-left:0cm;mso-list:l2 level1 lfo3'>1. Number entry 1<o:p></o:p></li>
  <li class=MsoListParagraph style='margin-left:0cm;mso-list:l2 level1 lfo3'>2. Number entry 2<o:p></o:p></li>
</ol>

We should remove the * and the 1. (and tab?) from the html text. Otherwise we can also fix this on our side by replacing type=disc with type=none or removing the first (two) char(s) in the list.

What do you think?

piu130 commented 1 year ago

I did not find an easy solution so far. Our current workaround (with org.jsoup:jsoup:1.15.4, simplified):

var fixedMailBody = mailBody // org.jsoup.nodes.Document
    .select("ul li,ol li")
    .forEach(li -> {
        var text = li.text();
        var matcher = Pattern.compile("^(\\*|\\d+\\.) (?<text>.*)").matcher(text);
        if (matcher.find()) {
            li.text(matcher.group("text"));
        }
    })
    .html();
bbottema commented 3 months ago

This bug should be fixed now, in v1.14.1