Closed GoogleCodeExporter closed 9 years ago
I don't understand in what way feedparser is breaking. I tried using both the
uploaded file and the source URL you provided above, and I tried it both with
and without BeautifulSoup for microformat parsing, and I tried it on Python 2.5
and 2.7. Could you elaborate?
Original comment by kurtmckee
on 30 Aug 2011 at 7:06
It breaks in that it raises an exception when you call the parse() method. The
original feed seems to have been fixed (I contacted the user regarding the
issue), but the feed I uploaded has the problem. I tested again after your
comment by getting the latest version from svn and then downloading the test
file from this issue page and ran the test and it failed again. Here is my
stack trace. I'm running this on Python 2.5 using the app engine SDK:
Traceback (most recent call last):
File "test/feed_parsing.py", line 76, in test_parsing
result = PshbHandler.new_update(blog, content, "nbhub", "", True)
File "/Users/waleed/nb/app/appengine/api/pshb.py", line 253, in new_update
data = feedparser.parse(feed_content, response_headers=feed_headers)
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 3889, in parse
saxparser.parse(source)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 349, in end_element_ns
self._cont_handler.endElementNS(pair, None)
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 1802, in endElementNS
self.unknown_endtag(localname)
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 655, in unknown_endtag
method()
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 1654, in _end_content
value = self.popContent('content')
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 970, in popContent
value = self.pop(tag)
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 873, in pop
output = _resolveRelativeURIs(output, self.baseuri, self.encoding, self.contentparams.get('type', u'text/html'))
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 2513, in _resolveRelativeURIs
p.feed(htmlSource)
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 1874, in feed
sgmllib.SGMLParser.feed(self, data)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 1854, in parse_starttag
j = self.__parse_starttag(i)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sgmllib.py", line 291, in parse_starttag
self.finish_starttag(tag, attrs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/sgmllib.py", line 333, in finish_starttag
self.unknown_starttag(tag, attrs)
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 2505, in unknown_starttag
attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 2501, in resolveURI
return _makeSafeAbsoluteURI(_urljoin(self.baseuri, uri.strip()))
File "/Users/waleed/nb/app/appengine/lib/feedparser.py", line 425, in _urljoin
uri = urlparse.urljoin(base, uri)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urlparse.py", line 288, in urljoin
return urlunparse((scheme, netloc, '/'.join(segments),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal
not in range(128)
Original comment by wal...@ninua.com
on 30 Aug 2011 at 8:57
Okay, I remember running into this bug in urljoin back when I was finagling
feedparser to work in Python 3. This is happening because '/' is a string, and
at least one of the segments being joined is a unicode string. Python is
stupidly type coercing to 'ascii' instead of 'utf-8'. However, I'm still not
able to recreate this issue in Python 2.5.6.
Can you give me any other information about the Python version on AppEngine?
What is the exact release they're using? Are there any published changes that
they've made to Python that would introduce this difference in our two versions?
Original comment by kurtmckee
on 30 Aug 2011 at 3:25
I get the error on both, the app engine, when I upload the app to
*.appspot.com, and on my the local app engine SDK. Locally, I use Python 2.5.5,
and I'm not sure the exact version on GAE, but I know it's 2.5.*.
Could it be that your default encoding is set to utf-8? Try this:
import sys
print sys.getdefaultencoding()
I get 'ascii' on both, local and GAE. If yours return utf-8, then that would
explain why you don't see this problem.
Now that I think about this more, I think my fix above is not accurate because
I'm assuming 'utf-8'. Since utf-8 is probably the most popular encoding anyway,
the fix will work most of the time and better than not having it. But the right
way is to use whatever encoding the document is using. The same applies to the
following line in the code which also uses utf-8 (this is the line I copied for
my fix).
Original comment by wal...@ninua.com
on 30 Aug 2011 at 6:27
Ah, I'm able to recreate the problem now...but only if I pass the attachment
URL above directly to feedparser.
If I save the file and then parse it, no crash.
If I save the file *and* pass in the original headers using the
`response_headers` argument, no crash.
If I use pdb to wipe out the headers immediately after the URL's been
requested, feedparser still crashes.
I love a good challenge! Most important to me is figuring out why I can't
recreate the problem except over HTTP; without knowing that, I can't create an
all-important test case. Unfortunately I probably won't have enough time
tonight to figure that out.
I will tell you that I will not have any strings gallivanting about in the
original encoding after parsing has begun; feedparser will internally use
unicode objects exclusively (although sgmllib forces us to deal with
utf8-encoded strings like chumps). That said, you should be guaranteed that
strings are encoded in utf8 by the time they reaches that point in the code. :)
Original comment by kurtmckee
on 31 Aug 2011 at 6:01
Shortly after posting that I realized that I realized that I just needed an
`xml:base` attribute to ensure that the right code path was followed. I should
have time to bang out a test case and apply the patch in the next few days.
Original comment by kurtmckee
on 31 Aug 2011 at 6:47
Fixed in r577. Thanks for providing this patch!
Original comment by kurtmckee
on 2 Sep 2011 at 6:01
Issue 351 has been merged into this issue.
Original comment by kurtmckee
on 18 May 2012 at 3:40
Original issue reported on code.google.com by
wal...@ninua.com
on 30 Aug 2011 at 5:59Attachments: