enclosure-sniffing microformat code can throw ValueError

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
>>> feedparser.parse('http://www.shareable.net/blog/all/feed')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 3975, in parse
    saxparser.parse(source)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 349, in end_element_ns
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 1853, in endElementNS
    self.unknown_endtag(localname)
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 700, in unknown_endtag
    method()
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 1604, in _end_description
    value = self.popContent('description')
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 1020, in popContent
    value = self.pop(tag)
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 926, in pop
    mfresults = _parseMicroformats(output, self.baseuri, self.encoding)
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 2515, in _parseMicroformats
    p.findEnclosures()
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 2489, in findEnclosures
    if not enclosure_match.search(elm.get('rel', u'')) and not self.isProbablyDownloadable(elm):
  File "/Users/sclay/projects/newsblur/utils/feedparser.py", line 2458, in isProbablyDownloadable
    path = urlparse.urlparse(attrsD['href'])[2]
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urlparse.py", line 134, in urlparse
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urlparse.py", line 182, in urlsplit
ValueError: Invalid IPv6 URL

What is the expected output? What do you see instead?
Expected a valid dictionary. The RSS feed is valid in the W3C feed validator: 
http://validator.w3.org/appc/check.cgi?url=http%3A%2F%2Fwww.shareable.net%2Fblog
%2Fall%2Ffeed

What version of the product are you using? On what operating system?
5.1.2 (r737) on both Ubuntu 12.04 and Mac OS 10.7 with Python 2.7 on both.

Please provide any additional information below.
Noticed all of the other ValueError/IPv6 errors. This seems to be coming from 
saxparser, but it should still throw a bozo or try to skip the invalid story.

Original issue reported on code.google.com by conesus on 1 Jul 2012 at 7:13

GoogleCodeExporter commented 9 years ago

I checked the feed with your link:
http://validator.w3.org/appc/check.cgi?url=http%3A%2F%2Fwww.shareable.net%2Fblog
%2Fall%2Ffeed

and the output is the following:
--------------------------------------------------------------------------------
--
Sorry

This feed does not validate.

    'utf8' codec can't decode byte 0x80 in position 25355: invalid start byte (maybe a high-bit character?) [help]

    line 162, column 233: XML parsing error: <unknown>:162:233: undefined entity [help]

        ... States Federation of Worker Cooperativesâ?? endorsement of legisla ...
                                                     ^

In addition, interoperability with the widest range of feed readers could be 
improved by implementing the following recommendations.

    line 3, column 11: title should not be blank [help]

            <title></title>
                   ^

    line 20, column 0: description should not contain iframe tag (2 occurrences) [help]

        <p> <iframe allowfullscreen="" frameborder="0"  ...

    line 26, column 0: description should not contain relative URL references: /blog/finally-a-thrift-store-anthem (13 occurrences) [help]

        </description>

Source: http://www.shareable.net/blog/all/feed

Original comment by schla...@gmail.com on 7 Sep 2012 at 9:23

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I run a news reader (NewsBlur), so I can get you dozens of these. All of the 
following sites have the same issue:

http://feeds.feedburner.com/IeeeSpectrumTechTalkBlog
http://dgrin.com/external.php?type=RSS2&forumids=36
http://feeds.feedburner.com/bocoup
http://weblog.bocoup.com/feed
http://blog.londonjewelleryschool.co.uk/feed/
http://bbf.enssib.fr/blog/rss?type=co
http://fulltextrssfeed.com/feeds.feedburner.com/thedailybeast/articles

Whew.

Original comment by conesus on 7 Sep 2012 at 9:27

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

All these feeds contain invalid urls in the content tag:
- http://feeds.feedburner.com/IeeeSpectrumTechTalkBlog -> http:// such formal 
invitation [http://blog.vixra.org/2012/07/02/higgs-en-route-for-cern/]
- http://feeds.feedburner.com/bocoup -> http://modernizr.com]
- http://weblog.bocoup.com/feed -> http://modernizr.com]
- http://blog.londonjewelleryschool.co.uk/feed/ -> 
http://[www.startupbritain.org
- http://bbf.enssib.fr/blog/rss?type=co -> 
http://[http://unpetitcabanon.vox.com/library/post/albi-nous-interpelle.html?_c=
feed- atom

I had no problems parsing this feeds:
- http://dgrin.com/external.php?type=RSS2&forumids=36
- http://fulltextrssfeed.com/feeds.feedburner.com/thedailybeast/articles

I'm not the maintainer and I went through the bug reports today and tried to 
help.
So I created a pull request for this issue on github and hopefully the 
maintainer will merge my patch in the near future
https://github.com/kurtmckee/feedparser/pull/8

Original comment by schla...@gmail.com on 7 Sep 2012 at 10:56

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

conesus, I'd like to make sure that you're not running into crashes! I'm 
familiar with this crash; it happens in the newer versions of Python 2.7 (and 
2.6 if I recall correctly). I thought I fixed this throughout the codebase, but 
I see that this is happening during microformat parsing (a constant source of 
problems).

It's my intention to remove microformat parsing from feedparser entirely, but 
in the meantime would you try setting the following global variable before 
parsing and see if this error goes away?

import feedparser
feedparser.PARSE_MICROFORMATS = False

Original comment by kurtmckee on 19 Nov 2012 at 4:49

Changed state: NeedInfo
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Kurt, did you reviewed my patch? I think the problem is still in the codebase 
and could be fixed easily with try/catch

Original comment by schla...@gmail.com on 19 Nov 2012 at 7:03

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Bernd, you're right that the problem is still in the codebase! However, I'm 
going to completely remove the microformat parsing after the next release. It 
doesn't belong in feedparser.

Conesus (Samuel Clay?) please let me know if setting the PARSE_MICROFORMATS 
variable to False resolves the problem in the interim. I'm going to close this 
bug but I'll be notified when you respond. Thanks!

Original comment by kurtmckee on 26 Nov 2012 at 5:16

Changed state: WontFix
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Unfortunately, I cannot disable PARSE_MICROFORMATS because I need both tags and 
enclosures for my users. If you disable micro formats, then I'll have to parse 
out audio/video enclosures myself?

Original comment by conesus on 26 Nov 2012 at 7:20

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Hmm, I hadn't considered there was a strong need for capturing enclosures and 
tags from the microformats embedded in the HTML content. Can you confirm that 
the tags and enclosures aren't being parsed through the normal means (such as 
from the Atom, RSS, and Dublin Core XML tags that wouldn't be affected by 
turning off microformat parsing)?

I would actually be surprised but very interested to know that there are many 
feeds that use microformats, so this will be useful information!

For reference, the feed would need to contain HTML `a` tags in the content with 
`rel="enclosure"` or `rel="tag"` to be affected by turning off microformat 
parsing. It would also need to fail to include that information through (for 
example) an accompanying `enclosure` tag or `category` tag.

Original comment by kurtmckee on 27 Nov 2012 at 2:35

Changed state: NeedInfo
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

This issue was closed by revision ef3922c7f53c.

Original comment by kurtmckee on 6 Dec 2012 at 5:13

Changed state: Fixed
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I reconsidered, and it's better to fix the bug and worry about stripping code 
some other time. Bernd, thanks for the patch!

conesus, thanks for reporting this! The fix will be in feedparser 5.1.3, which 
I'm planning to release Real Soon Now (TM). You can grab the current code from 
the git repo in the mean time.

Please don't hesitate to open more reports whenever you run into a problem!

Original comment by kurtmckee on 6 Dec 2012 at 5:20

Changed title: enclosure-sniffing microformat code can throw ValueError
Added labels: ****
Removed labels: ****

HaveF / feedparser

enclosure-sniffing microformat code can throw ValueError #364