Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.82k stars 273 forks source link

Fix memory link when processing an abbr tag #284

Closed jdufresne closed 5 years ago

jdufresne commented 5 years ago

Previously, after the first abbr tag, the abbr_data attribute continued to grow when processing additional data. Now set the attribute to None to avoid appending the unused data.

coveralls commented 5 years ago

Coverage Status

Coverage remained the same at 97.943% when pulling e3535ec6af3dcd2049b5c6e9bbc9738318c713f0 on jdufresne:abbr into 30cf8823dbad7c45161861b70c0155f4427158c8 on Alir3z4:master.

efung commented 1 year ago

This still seems to be happening with release 2020.1.16. My colleague @chhabrakadabra narrowed it down to this test case:

import html2text

sus_html = '<abbr title="Abbr title">Abbr</abbr>'

ht = html2text.HTML2Text()
ht.ignore_links = True
ht.ignore_images = True
ht.body_width = 0

print("Handling sus_html")
print(ht.handle(sus_html))
print("Handling normal html")
print(ht.handle("<p>hello</p>"))

The output is:

Handling sus_html
Abbr
  *[Abbr]: Abbr title

Handling normal html

hello
  *[Abbr]: Abbr title

Note that the second call to handle() leaks the expansion of the abbreviation that was passed in the first call. Calling ht.finish() does not reset the internal abbreviations (and I can see that ht.handle() already calls finish.