A1ex2015 / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

_RelativeURLResolver passes encoded URLs to _urljoin #407

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I use feedparser in my rawdog feed aggregator. This bug's present in 5.1.3 and 
in the latest Git HEAD.

One of rawdog's users spotted that feedparser was mangling the links in the 
HTML in news.ycombinator.com's RSS feed (https://news.ycombinator.com/rss). 
This feed has the unusual property that all slashes in URLs are escaped as 
/ -- so its HTML includes things like:

<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;...">

This is perfectly legit according to the HTML spec, but it confuses 
feedparser's _RelativeURLResolver, which passes URL-containing attributes down 
to _urljoin without removing character/entity references first. This ends up at 
urlparse.urljoin, which isn't expecting to find &x#2F; in its URLs, and winds 
up getting thoroughly confused as a result. The bug is usually harmless because 
the bits of the URL getting rewritten aren't usually encoded, but in this case 
it's definitely broken.

I think the fix would be to make _RelativeURLResolver decode entities before 
normalising URLs, then re-encode the normalised version.

The attached file is a trimmed-down example of this -- note you have to serve 
it from somewhere that'll trigger the normalisation to see the bug (i.e. 
feedparser.parse('weirdlink.rss') won't show it). It also includes a second 
example with an entity rather than character reference, which similarly doesn't 
get stripped before _urljoining.

Original issue reported on code.google.com by ats-goog...@offog.org on 17 Jun 2013 at 10:16

Attachments: