Closed jdufresne closed 5 years ago
This still seems to be happening with release 2020.1.16. My colleague @chhabrakadabra narrowed it down to this test case:
import html2text
sus_html = '<abbr title="Abbr title">Abbr</abbr>'
ht = html2text.HTML2Text()
ht.ignore_links = True
ht.ignore_images = True
ht.body_width = 0
print("Handling sus_html")
print(ht.handle(sus_html))
print("Handling normal html")
print(ht.handle("<p>hello</p>"))
The output is:
Handling sus_html
Abbr
*[Abbr]: Abbr title
Handling normal html
hello
*[Abbr]: Abbr title
Note that the second call to handle()
leaks the expansion of the abbreviation that was passed in the first call. Calling ht.finish()
does not reset the internal abbreviations (and I can see that ht.handle()
already calls finish.
Previously, after the first
abbr
tag, theabbr_data
attribute continued to grow when processing additional data. Now set the attribute toNone
to avoid appending the unused data.