Closed GoogleCodeExporter closed 9 years ago
Have you looked at that URL recently? It does not contain a feed anymore.
Original comment by jdd...@gmail.com
on 23 Mar 2013 at 4:46
Yeah, sorry about that. I did realize that later. I'm still trying to figure
out how to detect that kind of situation (given what is parseable). I am
assuming that the SAX exception happened, and then it used the fallback parsing
to get what it got, and that's probably the limit of what I can/should expect.
Original comment by bob.dick...@gmail.com
on 23 Mar 2013 at 9:05
I'm fine with this getting close as not a bug. I was going to do it myself,
but I don't see how.
Original comment by bob.dick...@gmail.com
on 23 Mar 2013 at 9:07
OK, after looking at this a little more, I'll add my $.02.
I'm doing a Google Reader clone and importing feeds from Google Reader
accounts. Today I had a few friends import their feeds, 440 in total. A
surprising number of these feeds (40+) failed in roughly the way The Onion
failed in this bug report (these could have been added to Reader years ago in
some cases). Which is to say that the feed URI points to a page that is not an
RSS feed (in a lot of these cases, like the one in this bug, it redirected to
an HTML page).
My problem, being new to feedparser, was that I naively assumed that all of the
"feed" structures in the result indicated that a feed was actually found. I
now see that feedparse just does it's best to parse whatever you throw at it as
if it were a feed, even when there is no indication in the content that that's
the case. Ok, fine.
So now my job is to pick through all of that stuff and figure out if the thing
is a feed (maybe a really crappy one), or if it's in fact something else. I'm
looking at a combination of version, whether there are any entries, and whether
the content-type header contained "xml", in order to determine if I should
consider this a "feed". If not, I look in the links to see if there is a feed
(alternate with an xml type), and I track that down. That approach seems to
work in most cases.
My primary complaint, I guess, is that it's not obvious that the thing that
feedparser gives back does not actually represent a feed and it's a bit of work
to figure that out, and it's not obvious to someone new to feedparser.
Original comment by bob.dick...@gmail.com
on 24 Mar 2013 at 9:00
Thanks for the input! This is indeed something that people have struggled with
but I haven't taken time to consider how to resolve this. There are perhaps
ways to resolve this (as an example, feedparser could do a quick check of the
beginning of the document to see if it looks like a feed) but I haven't
committed to adding this yet because there are still parsing problems that have
to be resolved, while developers could potentially use something like the
`requests` library to download and sniff the document before passing it to
Firefox.
I really appreciate the input, as it helps guide how I move forward with
feedparser development! Thanks! =)
Original comment by kurtmckee
on 27 Apr 2013 at 6:42
Original issue reported on code.google.com by
bob.dick...@gmail.com
on 15 Mar 2013 at 7:00