aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.58k stars 410 forks source link

Fix: Escaping parenthesis in URLs to prevent links and image syntax from breaking #44

Closed dreikanter closed 11 years ago

dreikanter commented 11 years ago

I found another bug: URLs containing unencoded brackets and parenthesis could break markdown links and images syntax when INLINE_LINKS = True. For example the following code:

<a href="http://example.com/;-)/page.html">...</a>

will be transformed to

[...](http://example.com/;-)/page.html)

which will give the following MD transformation result (tested with python-markdown and dillinger.io):

<a href="http://example.com/;-">...</a>)/page.html

The same problem could appear with image source values. This case is not very common but some sites (e.g. MSDN) actively uses parenthesis in URLs.

I've fixed this by escaping parenthesis and brackets, so the result will produce correct HTML:

 [...](http://example.com/;-\)/page.html)

This patch includes bugfix and new test case for URL escaping.

aaronsw commented 11 years ago

I feel like I must be missing something, but reading the diff it seems like the latest version calls a version of md_escape that doesn't do any escaping. Furthermore, the special characters in the tests aren't escaped either (which is presumably why they passed).

>>> import re
>>> md_chars_matcher = re.compile(r"([\\`\*_{}\[\]\(\)#\+-\.!])")
>>> md_chars_matcher.sub(r"\1", '(x)f')
'(x)f'
dreikanter commented 11 years ago

I forgot to push local changes before pull request, sorry. Now it should be ok.

aaronsw commented 11 years ago

Looks good, thanks!