Parsing character reference with uppercase `X` throws ValueError

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

import feedparser
content = '''
<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0">
  <channel>
    <item><description>&amp;#X61;</description></item>
  </channel>
</rss>
'''
feedparser.parse(content)

What is the expected output? What do you see instead?

I expected the parse to be successful, with parsed value being "&#x61;"

Instead, I see this exception:

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in parse(url_file_stream_or_string, etag, modified, agent, referrer, handlers, 
request_headers, response_headers)
   4010     if not use_strict_parser and _SGML_AVAILABLE:
   4011         feedparser = _LooseFeedParser(baseuri, baselang, 'utf-8', entities)
-> 4012         feedparser.feed(data.decode('utf-8', 'replace'))
   4013     result['feed'] = feedparser.feeddata
   4014     result['entries'] = feedparser.entries

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in feed(self, data)
   1931             if self.encoding and isinstance(data, unicode):
   1932                 data = data.encode(self.encoding)
-> 1933         sgmllib.SGMLParser.feed(self, data)
   1934         sgmllib.SGMLParser.close(self)
   1935 

/usr/lib/python2.7/sgmllib.pyc in feed(self, data)
    102 
    103         self.rawdata = self.rawdata + data
--> 104         self.goahead(0)
    105 
    106     def close(self):

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in goahead(self, end)
    141                     continue
    142                 if rawdata.startswith("</", i):
--> 143                     k = self.parse_endtag(i)
    144                     if k < 0: break
    145                     i = k

/usr/lib/python2.7/sgmllib.pyc in parse_endtag(self, i)
    318         if rawdata[j] == '>':
    319             j = j+1
--> 320         self.finish_endtag(tag)
    321         return j
    322 

/usr/lib/python2.7/sgmllib.pyc in finish_endtag(self, tag)
    358                     method = getattr(self, 'end_' + tag)
    359                 except AttributeError:
--> 360                     self.unknown_endtag(tag)
    361                 else:
    362                     self.report_unbalanced(tag)

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in unknown_endtag(self, tag)
    707                 raise AttributeError()
    708             method = getattr(self, methodname)
--> 709             method()
    710         except AttributeError:
    711             self.pop(prefix + suffix)

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in _end_description(self)
   1611             self._end_content()
   1612         else:
-> 1613             value = self.popContent('description')
   1614         self._summaryKey = None
   1615     _end_abstract = _end_description

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in popContent(self, tag)
   1027 
   1028     def popContent(self, tag):
-> 1029         value = self.pop(tag)
   1030         self.incontent -= 1
   1031         self.contentparams.clear()

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in pop(self, element, stripWhitespace)
    927         if is_htmlish and RESOLVE_RELATIVE_URIS:
    928             if element in self.can_contain_relative_uris:
--> 929                 output = _resolveRelativeURIs(output, self.baseuri, 
self.encoding, self.contentparams.get('type', u'text/html'))
    930 
    931         # parse microformats

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in _resolveRelativeURIs(htmlSource, baseURI, encoding, _type)
   2570 
   2571     p = _RelativeURIResolver(baseURI, encoding, _type)
-> 2572     p.feed(htmlSource)
   2573     return p.output()
   2574 

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in feed(self, data)
   1931             if self.encoding and isinstance(data, unicode):
   1932                 data = data.encode(self.encoding)
-> 1933         sgmllib.SGMLParser.feed(self, data)
   1934         sgmllib.SGMLParser.close(self)
   1935 

/usr/lib/python2.7/sgmllib.pyc in feed(self, data)
    102 
    103         self.rawdata = self.rawdata + data
--> 104         self.goahead(0)
    105 
    106     def close(self):

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in goahead(self, end)
    184                 if match:
    185                     name = match.group(1)
--> 186                     self.handle_charref(name)
    187                     i = match.end(0)
    188                     if rawdata[i-1] != ';': i = i-1

/home/chungwu/.virtualenvs/pod/local/lib/python2.7/site-packages/feedparser.pyc 
in handle_charref(self, ref)
   1984             value = int(ref[1:], 16)
   1985         else:
-> 1986             value = int(ref)
   1987 
   1988         if value in _cp1252:

ValueError: invalid literal for int() with base 10: 'X61'

What version of the product are you using? On what operating system?

Version 5.1.2; Ubuntu 12.04; Python 2.7

Please provide any additional information below.

If the text had instead been "&amp;#x61;" (with a lower-case "x" instead of 
upper-case "X"), then this works as expected without errors.

Original issue reported on code.google.com by chun...@gmail.com on 25 Sep 2012 at 9:21

GoogleCodeExporter commented 9 years ago

Nice! With a unit test! Fantastic!

Original comment by kurtmckee on 19 Nov 2012 at 3:50

Changed title: Parsing character reference with uppercase X throws ValueError
Changed state: Accepted
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

This issue was closed by revision 62ec5583a3bd.

Original comment by kurtmckee on 28 Nov 2012 at 5:35

Changed state: Fixed
Added labels: ****
Removed labels: ****

HaveF / feedparser

Parsing character reference with uppercase `X` throws ValueError #376