Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.81k stars 272 forks source link

Too much escaping breaks URLs containing parentheses #322

Open mborsetti opened 4 years ago

mborsetti commented 4 years ago

Steps to reproduce:

  1. Start with an html a tag with an href containing a parenthesis, e.g. to https://www.sample.com/?url-with-(parenthesized-text)-)-[and-brackets]
  2. Pass it through html2text (see code below)
  3. Email the resulting string to yourself, which in our example will be: [](https://www.sample.com/?url-with-\(parenthesized-text\)-\)\[and-brackets\])
  4. Open the email in a modern system (Gmail in my case)
  5. The clickable URL in the email; it will now point to a different resource than the original one, in our example (https://www.sample.com/?url-with-\(parenthesized-text\)-\)\[and-brackets] (notice the extra backslashes)

Potential solutions:

  1. In __init__.py, modify line 459 from self.o("]({url}{title})".format(url=escape_md(url), title=title)) toself.o("]({url}{title})".format(url=url, title=title)); I don't know Markdown specs well enough but after trying a few markdown readers, the lack of escaping inside a URL doesn't seem to break anything -- even with the stray extra ")"
  2. Add a switch to suppress Markdown escaping in URLs (new use case, slower code)
  3. Others?

Any tips/feedback?


Code:

import html2text
import sys

print(f'{sys.version=}')
print(f'{html2text.__version__=}\n')

html = ('<html>\n<head>\n</head>\n<body>\n'
        '<a href="https://www.sample.com/?url-with-(parenthesized-text)-)-[and-brackets]"></a>\n'
        '</body></html>')
parser = html2text.HTML2Text()
parser.body_width = 0
print(parser.handle(html))

sys.version='3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)]'
html2text.__version__=(2020, 1, 16)

[](https://www.sample.com/?url-with-\(parenthesized-text\)-\)-\[and-brackets\])`