unicode characters cause crash during relative uri resolution

GoogleCodeExporter commented 9 years ago

Attached feed breaks the parser. The issue seems to be due to an invalid 
character in the href param of one of the <a> tags. FeedValidator indicates 
that the feed is valid:

http://feedvalidator.org/check.cgi?url=http%3A//feeds.feedburner.com/planetpov/m
fkm%3Fcat%3D-8%26cat%3D-12%26cat%3D-19%26cat%3D-91%26cat%3D-163

The expected result should be for feedparser to ignore the invalid characters 
rather than failing. 

This patch should solve the problem:

diff --git a/lib/feedparser.py b/lib/feedparser.py
index 93aac7d..0fff399 100644
--- a/lib/feedparser.py
+++ b/lib/feedparser.py
@@ -444,6 +444,8 @@ _urifixer = 
re.compile('^([A-Za-z][A-Za-z0-9+-.]*://)(/*)(.*?)')
 def _urljoin(base, uri):
     uri = _urifixer.sub(r'\1\3', uri)
     #try:
+    if not isinstance(uri, unicode):
+        uri = uri.decode('utf-8', 'ignore')
     uri = urlparse.urljoin(base, uri)
     if not isinstance(uri, unicode):
         return uri.decode('utf-8', 'ignore')

Original issue reported on code.google.com by wal...@ninua.com on 30 Aug 2011 at 5:59

Attachments:

[Valid but breaks feed parser - planetpov.com_feed.txt](https://storage.googleapis.com/google-code-attachments/feedparser/issue-303/comment-0/Valid but breaks feed parser - planetpov.com_feed.txt)

GoogleCodeExporter commented 9 years ago

I don't understand in what way feedparser is breaking. I tried using both the 
uploaded file and the source URL you provided above, and I tried it both with 
and without BeautifulSoup for microformat parsing, and I tried it on Python 2.5 
and 2.7. Could you elaborate?

Original comment by kurtmckee on 30 Aug 2011 at 7:06

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

It breaks in that it raises an exception when you call the parse() method. The 
original feed seems to have been fixed (I contacted the user regarding the 
issue), but the feed I uploaded has the problem. I tested again after your 
comment by getting the latest version from svn and then downloading the test 
file from this issue page and ran the test and it failed again. Here is my 
stack trace. I'm running this on Python 2.5 using the app engine SDK:

Traceback (most recent call last):
  File "test/feed_parsing.py", line 76, in test_parsing
    result = PshbHandler.new_update(blog, content, "nbhub", "", True)
  File "/Users/waleed/nb/app/appengine/api/pshb.py", line 253, in new_update
    data = feedparser.parse(feed_content, response_headers=feed_headers)
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 3889, in parse
    saxparser.parse(source)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 349, in end_element_ns
    self._cont_handler.endElementNS(pair, None)
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 1802, in endElementNS
    self.unknown_endtag(localname)
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 655, in unknown_endtag
    method()
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 1654, in _end_content
    value = self.popContent('content')
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 970, in popContent
    value = self.pop(tag)
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 873, in pop
    output = _resolveRelativeURIs(output, self.baseuri, self.encoding, self.contentparams.get('type', u'text/html'))
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 2513, in _resolveRelativeURIs
    p.feed(htmlSource)
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 1874, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 1854, in parse_starttag
    j = self.__parse_starttag(i)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sgmllib.py", line 333, in finish_starttag
    self.unknown_starttag(tag, attrs)
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 2505, in unknown_starttag
    attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 2501, in resolveURI
    return _makeSafeAbsoluteURI(_urljoin(self.baseuri, uri.strip()))
  File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 425, in _urljoin
    uri = urlparse.urljoin(base, uri)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urlparse.py", line 288, in urljoin
    return urlunparse((scheme, netloc, '/'.join(segments),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal 
not in range(128)

Original comment by wal...@ninua.com on 30 Aug 2011 at 8:57

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Okay, I remember running into this bug in urljoin back when I was finagling 
feedparser to work in Python 3. This is happening because '/' is a string, and 
at least one of the segments being joined is a unicode string. Python is 
stupidly type coercing to 'ascii' instead of 'utf-8'. However, I'm still not 
able to recreate this issue in Python 2.5.6.

Can you give me any other information about the Python version on AppEngine? 
What is the exact release they're using? Are there any published changes that 
they've made to Python that would introduce this difference in our two versions?

Original comment by kurtmckee on 30 Aug 2011 at 3:25

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I get the error on both, the app engine, when I upload the app to 
*.appspot.com, and on my the local app engine SDK. Locally, I use Python 2.5.5, 
and I'm not sure the exact version on GAE, but I know it's 2.5.*.

Could it be that your default encoding is set to utf-8? Try this:

    import sys
    print sys.getdefaultencoding()

I get 'ascii' on both, local and GAE. If yours return utf-8, then that would 
explain why you don't see this problem. 

Now that I think about this more, I think my fix above is not accurate because 
I'm assuming 'utf-8'. Since utf-8 is probably the most popular encoding anyway, 
the fix will work most of the time and better than not having it. But the right 
way is to use whatever encoding the document is using. The same applies to the 
following line in the code which also uses utf-8 (this is the line I copied for 
my fix).

Original comment by wal...@ninua.com on 30 Aug 2011 at 6:27

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Ah, I'm able to recreate the problem now...but only if I pass the attachment 
URL above directly to feedparser.

If I save the file and then parse it, no crash.
If I save the file *and* pass in the original headers using the 
`response_headers` argument, no crash.
If I use pdb to wipe out the headers immediately after the URL's been 
requested, feedparser still crashes.

I love a good challenge! Most important to me is figuring out why I can't 
recreate the problem except over HTTP; without knowing that, I can't create an 
all-important test case. Unfortunately I probably won't have enough time 
tonight to figure that out.

I will tell you that I will not have any strings gallivanting about in the 
original encoding after parsing has begun; feedparser will internally use 
unicode objects exclusively (although sgmllib forces us to deal with 
utf8-encoded strings like chumps). That said, you should be guaranteed that 
strings are encoded in utf8 by the time they reaches that point in the code. :)

Original comment by kurtmckee on 31 Aug 2011 at 6:01

Changed state: Accepted
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Shortly after posting that I realized that I realized that I just needed an 
`xml:base` attribute to ensure that the right code path was followed. I should 
have time to bang out a test case and apply the patch in the next few days.

Original comment by kurtmckee on 31 Aug 2011 at 6:47

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Fixed in r577. Thanks for providing this patch!

Original comment by kurtmckee on 2 Sep 2011 at 6:01

Changed title: unicode characters cause crash during relative uri resolution
Changed state: Fixed
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Issue 351 has been merged into this issue.

Original comment by kurtmckee on 18 May 2012 at 3:40

Added labels: ****
Removed labels: ****

HaveF / feedparser

unicode characters cause crash during relative uri resolution #303