Closed StianTorjussen closed 5 years ago
This is very strange! So what appears to be happening is when we call soup.findAll(text=url_re)
beautifulsoup is returning the URLs that it finds but also converting the entities at the same time.
Looking at the soup
variable in the parse_html()
function, it appears that the entities are still intact prior to the call to findAll()
:
140 soup = soup_class(html, **bs_kwargs)
141
142 for url in soup.findAll(text=re.compile(url_re)):
--> 143 if not _inside_skip(url):
144 if _is_standalone(url):
145 url_handler = handler
146 else:
147 url_handler = block_handler
148
ipdb> soup
<p>http://google.com <script></script></p>
But if I ask what "url" is (recall that url
is what beautifulsoup matched using the url_re
):
ipdb> url
'http://google.com <script></script>'
I honestly have no idea on this one.
Ahh, OK, so when you do a findAll
it iterates over the "soup's" descendants
-- which reveals where the conversion is happening:
In [13]: s = BeautifulSoup('<foo>')
In [14]: list(s.descendants)
Out[14]: ['<foo>']
In [15]: s
Out[15]: <foo>
Better patch / test in 3404ab1. I think this can be marked resolved, but please let me know if you are still encountering issues.
If you have encoded html characters like
<
and>
inside the same html tag as an untagged link,parse_html
will decode the encoded characters in stead of skipping them. This is inconsistent with the behavior when the encoded character is not inside the same tag as the untagged link, or if the link is already tagged.Encoded characters next to an untagged link:
Output:
Here the encoded characters are decoded.
Encoded characters next to a tagged link:
Output:
Here the encoded characters are not decoded.
Encoded characters alone:
Output:
Here the encoded characters are not decoded.
Environment:
What is the intended behavior for
parse_html
?