bbottema / rtf-to-html

RTF to HTML conversion done right
8 stars 5 forks source link

Bullet numbers in list items have double numbers #14

Closed bbottema closed 3 months ago

bbottema commented 3 months ago

See https://github.com/bbottema/simple-java-mail/issues/530 See https://github.com/nickrussler/email-to-pdf-converter/issues/35 See https://github.com/bbottema/outlook-message-parser/issues/64

This is an issue in the RTF parser, which is beyond me at the moment. I have no fix.

nickrussler commented 3 months ago

To provide compatibility with existing RTF readers, all applications that can automatically format paragraphs with bullets or numbers will also emit the generated text as plain text in the \pntext group. This will allow existing RTF readers to capture the plain text and safely ignore the autonumber instructions. This group precedes all bulleted or numbered paragraphs, and will contain all the text and formatting that would be auto-generated. It should precede the '{'*\pn ... '}' destination, and it is the responsibility of RTF readers that understand the '{'*\pn ... '}' destination to ignore the \pntext group. (https://www.biblioscape.com/rtf15_spec.htm)

reads to me that the {\pntext 2.\tab} in e.g.

{\*\htmltag64 <li class=MsoListParagraph style='margin-left:0cm;mso-list:l0 level1 lfo1'>}\htmlrtf {{\*\pn\pnlvlbody\pndec\pnstart2\pnindent360{\pntxta .}}\htmlrtf0 \li360 \fi-360 {\pntext 2.\tab}Test2

{\*\htmltag244 <o:p>}

{\*\htmltag252 </o:p>}\htmlrtf\par}\htmlrtf0

{\*\htmltag72 </li>}

can be ignored / stripped under the condition that the reader (which is a browser due to the conversion) can render numbered lists. I am not sure if that is conditional on the ol html tag having set the type attribute or not.

bbottema commented 3 months ago

Well that turned out to be an extremely simple fix. Thanks so much! Released in 1.1.1. I'll update outlook-msg-parser shortly.

nickrussler commented 3 months ago

Awesome! Thanks for the fix