Closed GoogleCodeExporter closed 9 years ago
I checked the feed with your link:
http://validator.w3.org/appc/check.cgi?url=http%3A%2F%2Fwww.shareable.net%2Fblog
%2Fall%2Ffeed
and the output is the following:
--------------------------------------------------------------------------------
--
Sorry
This feed does not validate.
'utf8' codec can't decode byte 0x80 in position 25355: invalid start byte (maybe a high-bit character?) [help]
line 162, column 233: XML parsing error: <unknown>:162:233: undefined entity [help]
... States Federation of Worker Cooperativesâ?? endorsement of legisla ...
^
In addition, interoperability with the widest range of feed readers could be
improved by implementing the following recommendations.
line 3, column 11: title should not be blank [help]
<title></title>
^
line 20, column 0: description should not contain iframe tag (2 occurrences) [help]
<p> <iframe allowfullscreen="" frameborder="0" ...
line 26, column 0: description should not contain relative URL references: /blog/finally-a-thrift-store-anthem (13 occurrences) [help]
</description>
Source: http://www.shareable.net/blog/all/feed
Original comment by schla...@gmail.com
on 7 Sep 2012 at 9:23
I run a news reader (NewsBlur), so I can get you dozens of these. All of the
following sites have the same issue:
http://feeds.feedburner.com/IeeeSpectrumTechTalkBlog
http://dgrin.com/external.php?type=RSS2&forumids=36
http://feeds.feedburner.com/bocoup
http://weblog.bocoup.com/feed
http://blog.londonjewelleryschool.co.uk/feed/
http://bbf.enssib.fr/blog/rss?type=co
http://fulltextrssfeed.com/feeds.feedburner.com/thedailybeast/articles
Whew.
Original comment by conesus
on 7 Sep 2012 at 9:27
All these feeds contain invalid urls in the content tag:
- http://feeds.feedburner.com/IeeeSpectrumTechTalkBlog -> http:// such formal
invitation [http://blog.vixra.org/2012/07/02/higgs-en-route-for-cern/]
- http://feeds.feedburner.com/bocoup -> http://modernizr.com]
- http://weblog.bocoup.com/feed -> http://modernizr.com]
- http://blog.londonjewelleryschool.co.uk/feed/ ->
http://[www.startupbritain.org
- http://bbf.enssib.fr/blog/rss?type=co ->
http://[http://unpetitcabanon.vox.com/library/post/albi-nous-interpelle.html?_c=
feed- atom
I had no problems parsing this feeds:
- http://dgrin.com/external.php?type=RSS2&forumids=36
- http://fulltextrssfeed.com/feeds.feedburner.com/thedailybeast/articles
I'm not the maintainer and I went through the bug reports today and tried to
help.
So I created a pull request for this issue on github and hopefully the
maintainer will merge my patch in the near future
https://github.com/kurtmckee/feedparser/pull/8
Original comment by schla...@gmail.com
on 7 Sep 2012 at 10:56
conesus, I'd like to make sure that you're not running into crashes! I'm
familiar with this crash; it happens in the newer versions of Python 2.7 (and
2.6 if I recall correctly). I thought I fixed this throughout the codebase, but
I see that this is happening during microformat parsing (a constant source of
problems).
It's my intention to remove microformat parsing from feedparser entirely, but
in the meantime would you try setting the following global variable before
parsing and see if this error goes away?
import feedparser
feedparser.PARSE_MICROFORMATS = False
Original comment by kurtmckee
on 19 Nov 2012 at 4:49
Kurt, did you reviewed my patch? I think the problem is still in the codebase
and could be fixed easily with try/catch
Original comment by schla...@gmail.com
on 19 Nov 2012 at 7:03
Bernd, you're right that the problem is still in the codebase! However, I'm
going to completely remove the microformat parsing after the next release. It
doesn't belong in feedparser.
Conesus (Samuel Clay?) please let me know if setting the PARSE_MICROFORMATS
variable to False resolves the problem in the interim. I'm going to close this
bug but I'll be notified when you respond. Thanks!
Original comment by kurtmckee
on 26 Nov 2012 at 5:16
Unfortunately, I cannot disable PARSE_MICROFORMATS because I need both tags and
enclosures for my users. If you disable micro formats, then I'll have to parse
out audio/video enclosures myself?
Original comment by conesus
on 26 Nov 2012 at 7:20
Hmm, I hadn't considered there was a strong need for capturing enclosures and
tags from the microformats embedded in the HTML content. Can you confirm that
the tags and enclosures aren't being parsed through the normal means (such as
from the Atom, RSS, and Dublin Core XML tags that wouldn't be affected by
turning off microformat parsing)?
I would actually be surprised but very interested to know that there are many
feeds that use microformats, so this will be useful information!
For reference, the feed would need to contain HTML `a` tags in the content with
`rel="enclosure"` or `rel="tag"` to be affected by turning off microformat
parsing. It would also need to fail to include that information through (for
example) an accompanying `enclosure` tag or `category` tag.
Original comment by kurtmckee
on 27 Nov 2012 at 2:35
This issue was closed by revision ef3922c7f53c.
Original comment by kurtmckee
on 6 Dec 2012 at 5:13
I reconsidered, and it's better to fix the bug and worry about stripping code
some other time. Bernd, thanks for the patch!
conesus, thanks for reporting this! The fix will be in feedparser 5.1.3, which
I'm planning to release Real Soon Now (TM). You can grab the current code from
the git repo in the mean time.
Please don't hesitate to open more reports whenever you run into a problem!
Original comment by kurtmckee
on 6 Dec 2012 at 5:20
Original issue reported on code.google.com by
conesus
on 1 Jul 2012 at 7:13