HTMLParseError with engadget malformed item description

GoogleCodeExporter commented 9 years ago

the engadget.com feed [1] could not be parsed today.
I saved the feed for reproducibility [2] and after some investigation extracted 
the <item> that was causing the trouble with lxml.etree and saved the resulting 
feed with this single item as well [3]

What steps will reproduce the problem?
1. python -c "import feedparser; 
feedparser.parse('http://mister-muffin.de/p/poNJ.txt')"

What is the expected output? What do you see instead?

-%<--------------------------------------------
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/pymodules/python2.6/feedparser.py", line 3822, in parse
    feedparser.feed(data.decode('utf-8', 'replace'))
  File "/usr/lib/pymodules/python2.6/feedparser.py", line 1851, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "/usr/lib/python2.6/sgmllib.py", line 104, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/sgmllib.py", line 143, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python2.6/sgmllib.py", line 320, in parse_endtag
    self.finish_endtag(tag)
  File "/usr/lib/python2.6/sgmllib.py", line 360, in finish_endtag
    self.unknown_endtag(tag)
  File "/usr/lib/pymodules/python2.6/feedparser.py", line 657, in unknown_endtag
    method()
  File "/usr/lib/pymodules/python2.6/feedparser.py", line 1545, in _end_description
    value = self.popContent('description')
  File "/usr/lib/pymodules/python2.6/feedparser.py", line 961, in popContent
    value = self.pop(tag)
  File "/usr/lib/pymodules/python2.6/feedparser.py", line 868, in pop
    mfresults = _parseMicroformats(output, self.baseuri, self.encoding)
  File "/usr/lib/pymodules/python2.6/feedparser.py", line 2420, in _parseMicroformats
    p = _MicroformatsParser(htmlSource, baseURI, encoding)
  File "/usr/lib/pymodules/python2.6/feedparser.py", line 2024, in __init__
    self.document = BeautifulSoup.BeautifulSoup(data)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
    self.builder.feed(markup)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 229, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 304, in check_for_whole_start_tag
    self.error("malformed start tag")
  File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 12, column 110
->%--------------------------------------------

What version of the product are you using? On what operating system?

Python 2.6.7
python-feedparser 5.0.1-1
python-beautifulsoup 3.1.0.1-2
Debian unstable (as of 8 dec 2011)

Please provide any additional information below.

I investigated a bit further and extracted the item description that was 
causing the problem and tried to parse it with beatiful soup manually:

python -c "from lxml import etree; import BeautifulSoup; tree = 
etree.parse('out'); 
BeautifulSoup.BeautifulSoup(tree.findall('//channel/item/description')[0].text);
"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
    self.builder.feed(markup)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 229, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 304, in check_for_whole_start_tag
    self.error("malformed start tag")
  File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 12, column 110

so it is a problem with beautifulsoup but shouldnt feedparser not guard against 
such a problem and not throw an exception?

is it instead advised that I put a try/except around feedparser.parse()?

[1] http://www.engadget.com/exclude/Apple/rss.xml
[2] http://mister-muffin.de/p/q1sr
[3] http://mister-muffin.de/p/poNJ.txt

Original issue reported on code.google.com by j.scha...@web.de on 8 Dec 2011 at 12:20

GoogleCodeExporter commented 9 years ago

Excellent bug report, thanks! I'm not able to reproduce this using any of the 
linked feeds with feedparser 5.1 and BeautifulSoup 3.2.0. Would you try 
reproducing the issue after upgrading?

Feedparser doesn't support BeautifulSoup 3.1.0.1; the author has written that 
he considers it a failed experiment, so feedparser doesn't support that 
version. I also know that several BeautifulSoup-related crashes were fixed in 
5.1, so I expect that after upgrading that'll fix the problem.

As for wrapping feedparser in a try/except, you can if you want, but 
feedparser's goal is to never crash, which is why reports like this are so 
valuable to improving the software.

Original comment by kurtmckee on 8 Dec 2011 at 3:39

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

an update to BeautifulSoup 3.2.0 fixed the issue

sorry for not having tried this out myself - you can close this now

Original comment by j.scha...@web.de on 9 Dec 2011 at 7:47

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Glad that worked!

Original comment by kurtmckee on 9 Dec 2011 at 7:49

Changed state: Invalid
Added labels: ****
Removed labels: ****

HaveF / feedparser

HTMLParseError with engadget malformed item description #312