I use feedparser in my rawdog feed aggregator. This bug's present in 5.1.3 and
in the latest Git HEAD.
One of rawdog's users spotted that feedparser was mangling the links in the
HTML in news.ycombinator.com's RSS feed (https://news.ycombinator.com/rss).
This feed has the unusual property that all slashes in URLs are escaped as
/ -- so its HTML includes things like:
<a href="https://news.ycombinator.com/...">
This is perfectly legit according to the HTML spec, but it confuses
feedparser's _RelativeURLResolver, which passes URL-containing attributes down
to _urljoin without removing character/entity references first. This ends up at
urlparse.urljoin, which isn't expecting to find &x#2F; in its URLs, and winds
up getting thoroughly confused as a result. The bug is usually harmless because
the bits of the URL getting rewritten aren't usually encoded, but in this case
it's definitely broken.
I think the fix would be to make _RelativeURLResolver decode entities before
normalising URLs, then re-encode the normalised version.
The attached file is a trimmed-down example of this -- note you have to serve
it from somewhere that'll trigger the normalisation to see the bug (i.e.
feedparser.parse('weirdlink.rss') won't show it). It also includes a second
example with an entity rather than character reference, which similarly doesn't
get stripped before _urljoining.
Original issue reported on code.google.com by ats-goog...@offog.org on 17 Jun 2013 at 10:16
Original issue reported on code.google.com by
ats-goog...@offog.org
on 17 Jun 2013 at 10:16Attachments: