Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.81k stars 273 forks source link

Character reference replacement results in raw HTML #383

Open kevinoid opened 2 years ago

kevinoid commented 2 years ago

As a result of #109, character and entity references are unconditionally dereferenced. This causes HTML which contains character references representing HTML-like text to be converted to markdown with raw HTML by html2text 2017.10.4 and later:

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown
Horizontal rule is <hr>

To make the problem clearer, consider round-tripping from HTML to Markdown back to HTML:

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown | cmark
<p>Horizontal rule is <!-- raw HTML omitted --></p>

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown | cmark --unsafe
<p>Horizontal rule is <hr></p>

The conversion to markdown changes the meaning of the content by dereferencing the character references.

To satisfy the request in #109, I suggest preserving character and entity references which would be interpreted as Raw HTML if dereferenced. That would avoid producing unnecessary character references (as requested in #109) and also avoid changing the meaning of the content when it contains HTML-like text.

Thanks for considering, Kevin