coleifer / micawber

a small library for extracting rich content from urls
http://micawber.readthedocs.org/
MIT License
632 stars 91 forks source link

html_parse() unescapes html entities #21

Closed ivirabyan closed 11 years ago

ivirabyan commented 11 years ago
>>> parse_html('<img src="http://www.youtube.com/embed/_MbtUZSXQR4" /img>',       ProviderRegistry(), urlize_all=False)
u'<img src="http://www.youtube.com/embed/_MbtUZSXQR4" />'

As you can see, I passed in escaped html, and now it is unescaped I found out that it replaces escaped text with unescaped in this line:

 url.replaceWith(BeautifulSoup(replacement))

If we won't wrap replacement with BeautifulSoup, replaceWith() method automatically escapes the string while modifing the original soup. I really don't see the point why the replacement is wrapped by BeautifulSoup call.

coleifer commented 11 years ago

I think I recall dealing with something similar in the past. See issue #14 and #15