coleifer / micawber

a small library for extracting rich content from urls
http://micawber.readthedocs.org/
MIT License
635 stars 91 forks source link

Possible unintended behavior with parse_html? #91

Closed StianTorjussen closed 5 years ago

StianTorjussen commented 5 years ago

If you have encoded html characters like < and > inside the same html tag as an untagged link, parse_html will decode the encoded characters in stead of skipping them. This is inconsistent with the behavior when the encoded character is not inside the same tag as the untagged link, or if the link is already tagged.

Encoded characters next to an untagged link:

from micawber import ProviderRegistry
from micawber import parse_html
text = u'<p>http://www.google.com &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p><a href="http://www.google.com">http://www.google.com</a> <script> alert("foo"); </script></p>'

Here the encoded characters are decoded.

Encoded characters next to a tagged link:

text = u'<p><a href="http://www.google.com">http://www.google.com</a> &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p><a href="http://www.google.com">http://www.google.com</a> &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'

Here the encoded characters are not decoded.

Encoded characters alone:

text = u'<p>&lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p>&lt;script&gt; alert("foo"); &lt;/script&gt;</p>'

Here the encoded characters are not decoded.

Environment:

python2.7
Package                       Version
----------------------------- -------
backports.functools-lru-cache 1.5
beautifulsoup4                4.8.1
micawber                      0.5.0
pip                           19.2.3
pkg-resources                 0.0.0
setuptools                    41.4.0
soupsieve                     1.9.4
wheel                         0.33.6

What is the intended behavior for parse_html?

coleifer commented 5 years ago

This is very strange! So what appears to be happening is when we call soup.findAll(text=url_re) beautifulsoup is returning the URLs that it finds but also converting the entities at the same time.

Looking at the soup variable in the parse_html() function, it appears that the entities are still intact prior to the call to findAll():

    140     soup = soup_class(html, **bs_kwargs)
    141 
    142     for url in soup.findAll(text=re.compile(url_re)):
--> 143         if not _inside_skip(url):
    144             if _is_standalone(url):
    145                 url_handler = handler
    146             else:
    147                 url_handler = block_handler
    148 

ipdb> soup
<p>http://google.com &lt;script&gt;&lt;/script&gt;</p>

But if I ask what "url" is (recall that url is what beautifulsoup matched using the url_re):

ipdb> url
'http://google.com <script></script>'

I honestly have no idea on this one.

coleifer commented 5 years ago

Ahh, OK, so when you do a findAll it iterates over the "soup's" descendants -- which reveals where the conversion is happening:

In [13]: s = BeautifulSoup('&lt;foo&gt;')

In [14]: list(s.descendants)
Out[14]: ['<foo>']

In [15]: s
Out[15]: &lt;foo&gt;
coleifer commented 5 years ago

Better patch / test in 3404ab1. I think this can be marked resolved, but please let me know if you are still encountering issues.