Character reference replacement results in raw HTML

As a result of #109, character and entity references are unconditionally dereferenced. This causes HTML which contains character references representing HTML-like text to be converted to markdown with raw HTML by html2text 2017.10.4 and later:

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown
Horizontal rule is <hr>

To make the problem clearer, consider round-tripping from HTML to Markdown back to HTML:

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown | cmark
<p>Horizontal rule is <!-- raw HTML omitted --></p>

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown | cmark --unsafe
<p>Horizontal rule is <hr></p>

The conversion to markdown changes the meaning of the content by dereferencing the character references.

To satisfy the request in #109, I suggest preserving character and entity references which would be interpreted as Raw HTML if dereferenced. That would avoid producing unnecessary character references (as requested in #109) and also avoid changing the meaning of the content when it contains HTML-like text.

Thanks for considering, Kevin

Alir3z4 / html2text

Character reference replacement results in raw HTML #383