Unable to parse feed for The Onion

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. import feedparser
2. d = feedparser.parse("http://www.theonion.com/content/feeds/weekly")
3. d

What is the expected output? What do you see instead?

Expected output is parsed feed. Actual output is a minimal set of attributes 
(albeit with a giant HTML 'summary') with a bozo_exception and no entries. It 
ends with:

'href': u'http://www.theonion.com/feeds/weekly/', 'version': u'', 'entries': 
[], 'bozo_exception': SAXParseException('mismatched tag',), 'namespaces': {}}

What version of the product are you using? On what operating system?
feedparser 5.1.3
Python 2.7.2
Max OSX 10.8.2

Please provide any additional information below.

Original issue reported on code.google.com by bob.dick...@gmail.com on 15 Mar 2013 at 7:00

GoogleCodeExporter commented 9 years ago

Have you looked at that URL recently? It does not contain a feed anymore.

Original comment by jdd...@gmail.com on 23 Mar 2013 at 4:46

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Yeah, sorry about that.  I did realize that later.  I'm still trying to figure 
out how to detect that kind of situation (given what is parseable).  I am 
assuming that the SAX exception happened, and then it used the fallback parsing 
to get what it got, and that's probably the limit of what I can/should expect.

Original comment by bob.dick...@gmail.com on 23 Mar 2013 at 9:05

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I'm fine with this getting close as not a bug.  I was going to do it myself, 
but I don't see how.

Original comment by bob.dick...@gmail.com on 23 Mar 2013 at 9:07

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

OK, after looking at this a little more, I'll add my $.02.

I'm doing a Google Reader clone and importing feeds from Google Reader 
accounts.  Today I had a few friends import their feeds, 440 in total.  A 
surprising number of these feeds (40+) failed in roughly the way The Onion 
failed in this bug report (these could have been added to Reader years ago in 
some cases).  Which is to say that the feed URI points to a page that is not an 
RSS feed (in a lot of these cases, like the one in this bug, it redirected to 
an HTML page).

My problem, being new to feedparser, was that I naively assumed that all of the 
"feed" structures in the result indicated that a feed was actually found.  I 
now see that feedparse just does it's best to parse whatever you throw at it as 
if it were a feed, even when there is no indication in the content that that's 
the case.  Ok, fine.

So now my job is to pick through all of that stuff and figure out if the thing 
is a feed (maybe a really crappy one), or if it's in fact something else.  I'm 
looking at a combination of version, whether there are any entries, and whether 
the content-type header contained "xml", in order to determine if I should 
consider this a "feed".  If not, I look in the links to see if there is a feed 
(alternate with an xml type), and I track that down.  That approach seems to 
work in most cases.

My primary complaint, I guess, is that it's not obvious that the thing that 
feedparser gives back does not actually represent a feed and it's a bit of work 
to figure that out, and it's not obvious to someone new to feedparser.

Original comment by bob.dick...@gmail.com on 24 Mar 2013 at 9:00

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Thanks for the input! This is indeed something that people have struggled with 
but I haven't taken time to consider how to resolve this. There are perhaps 
ways to resolve this (as an example, feedparser could do a quick check of the 
beginning of the document to see if it looks like a feed) but I haven't 
committed to adding this yet because there are still parsing problems that have 
to be resolved, while developers could potentially use something like the 
`requests` library to download and sniff the document before passing it to 
Firefox.

I really appreciate the input, as it helps guide how I move forward with 
feedparser development! Thanks! =)

Original comment by kurtmckee on 27 Apr 2013 at 6:42

Changed state: Invalid
Added labels: ****
Removed labels: ****

HaveF / feedparser

Unable to parse feed for The Onion #393