Parsing takes minutes (Raspberry Pi)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. Run following python program with any of the feeds listed. Uncomment feed to 
test

#--------start source
import feedparser

print "Parse"
# this one is fast, takes about 8 seconds
#feed = feedparser.parse("http://www.engadget.com/rss.xml")
# next one is about 60 seconds
#feed = feedparser.parse("http://www.eevblog.com/feed/")

# these are the slow ones. Takes minutes to parse:
#feed = feedparser.parse("http://www.heise.de/newsticker/heise-atom.xml")
#feed = feedparser.parse("http://www.spiegel.de/schlagzeilen/tops/index.rss ")

print "Done Parse"
size = len(feed['entries'])
for i in range(0,size):
    print feed['entries'][i].title
#-------- end source

What is the expected output? What do you see instead?
Output is as expected, parsing takes more than 4 minutes per feed.

What version of the product are you using? On what operating system?
Raspberry Pi with standard image
Python 2.7.3
feedparser 5.1.3
BeautifulSoup 3.2.1 (and bs4 installed)

Please provide any additional information below.

Parsing is extremely slow. Seems not to be an encoding issue, as eevblog-feed 
is slow too. Not as slow as German feeds, but slow.
I have included an output from running "python myprogram.py --mcProfile" with 
the "www.heise.de"-feed

These following lines seem to use the most time:
  1538704   48.838    0.000   52.868    0.000 codingstatemachine.py:40(next_state)
      284   31.191    0.110   56.701    0.200 mbcharsetprober.py:52(feed)
     1425   87.178    0.061  115.210    0.081 sbcharsetprober.py:63(feed)
       95   18.991    0.200   42.172    0.444 utf8prober.py:50(feed)

Running the above command did result in errors:
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/cProfile.py", line 199, in <module>
    main()
  File "/usr/lib/python2.7/cProfile.py", line 192, in main
    runctx(code, globs, None, options.outfile, options.sort)
  File "/usr/lib/python2.7/cProfile.py", line 49, in runctx
    prof = prof.runctx(statement, globals, locals)
  File "/usr/lib/python2.7/cProfile.py", line 140, in runctx
    exec cmd in globals, locals
  File "feedreader.py", line 14, in <module>
    print feed['entries'][i].title
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 
19: ordinal not in range(128)

Perhaps this helps in figuring out what went wrong. I've searched for days now 
and still don't have a clue...

Hopefully my description makes sense to anyone... On my Mac the program runs 
extremely fast.

Original issue reported on code.google.com by drthomas...@googlemail.com on 16 Nov 2013 at 9:22

Attachments:

check3.txt

GoogleCodeExporter commented 9 years ago

I just ran the tests to complete my description. There were 3 failures:
======================================================================
FAIL: test_001279 (__main__.TestStrictParser)
./tests/wellformed/mf_hcard/3-5-5-org-unicode.xml: hcard contains non-ascii 
character
----------------------------------------------------------------------
Traceback (most recent call last):
  File "feedparsertest.py", line 752, in <lambda>
    self.failUnlessEval(xmlfile, evalString)
  File "feedparsertest.py", line 166, in failUnlessEval
    raise self.failureException, failure
AssertionError: not eval(not bozo and entries[0]['vcard'] == 
u"BEGIN:vCard\nVERSION:3.0\nORG:\u00b4\nEND:vCard") 
WITH env({'bozo': 0,
 'encoding': u'utf-8',
 'entries': [{'content': [{'base': u'',
                           'language': None,
                           'type': u'text/html',
                           'value': u'<div class="vcard">\n<span class="org">\xb4</span>\n</div>'}],
              'summary': u'<div class="vcard">\n<span class="org">\xb4</span>\n</div>',
              'vcard': u'BEGIN:vCard\nVERSION:3.0\nORG:\xc2\xb4\nEND:vCard'}],
 'feed': {},
 'namespaces': {'': u'http://www.w3.org/2005/Atom'},
 'version': u'atom10'})

======================================================================
FAIL: test_001279 (__main__.TestLooseParser)
./tests/wellformed/mf_hcard/3-5-5-org-unicode.xml: hcard contains non-ascii 
character
----------------------------------------------------------------------
Traceback (most recent call last):
  File "feedparsertest.py", line 752, in <lambda>
    self.failUnlessEval(xmlfile, evalString)
  File "feedparsertest.py", line 166, in failUnlessEval
    raise self.failureException, failure
AssertionError: not eval(not bozo and entries[0]['vcard'] == 
u"BEGIN:vCard\nVERSION:3.0\nORG:\u00b4\nEND:vCard") 
WITH env({'bozo': 0,
 'encoding': u'utf-8',
 'entries': [{'content': [{'base': u'',
                           'language': None,
                           'type': u'text/html',
                           'value': u'<div class="vcard">\n<span class="org">\xb4</span>\n</div>'}],
              'summary': u'<div class="vcard">\n<span class="org">\xb4</span>\n</div>',
              'vcard': u'BEGIN:vCard\nVERSION:3.0\nORG:\xc2\xb4\nEND:vCard'}],
 'feed': {},
 'namespaces': {'': u'http://www.w3.org/2005/Atom'},
 'version': u'atom10'})

======================================================================
FAIL: test_000018 (__main__.TestMicroformats)
./tests/microformats/hcard/3-1-1-fn-unicode-char.xml: unicode character in 
microformat
----------------------------------------------------------------------
Traceback (most recent call last):
  File "feedparsertest.py", line 752, in <lambda>
    self.failUnlessEval(xmlfile, evalString)
  File "feedparsertest.py", line 166, in failUnlessEval
    raise self.failureException, failure
AssertionError: not eval(not bozo and entries[0].vcard == 
u'BEGIN:vCard\nVERSION:3.0\nFN:Tantek 
\xc7elik\nN:\xc7elik;Tantek\nURL:http://tantek.com/\nEND:vCard') 
WITH env({'bozo': 0,
 'encoding': u'utf-8',
 'entries': [{'content': [{'base': u'',
                           'language': None,
                           'type': u'text/html',
                           'value': u'<span class="vcard"><a class="url fn" href="http://tantek.com/">Tantek \xc7elik</a></span>'}],
              'summary': u'<span class="vcard"><a class="url fn" href="http://tantek.com/">Tantek \xc7elik</a></span>',
              'vcard': u'BEGIN:vCard\nVERSION:3.0\nFN:Tantek \u0102&Dagger\\;elik\nN:\u0102&Dagger\\;elik;Tantek\nURL:http://tantek.com/\nEND:vCard'}],
 'feed': {},
 'namespaces': {'content': u'http://purl.org/rss/1.0/modules/content/'},
 'version': u'rss20'})

----------------------------------------------------------------------
Ran 4384 tests in 314.721s

Original comment by drthomas...@googlemail.com on 16 Nov 2013 at 10:24

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Thanks for this information! I appreciate that you ran the unit tests, but 
don't fret about those test failures -- I've removed the microformat parsing 
completely and that'll be in the next release of feedparser.

Quick question, have you tried disabling the HTML sanitization for additional 
speed comparisons? Also, uninstalling BeautifulSoup may really help speed 
things up. The microformat code was very slow.

I have an rpi and may have an opportunity to test this in the future.

Original comment by kurtmckee on 10 Jul 2014 at 2:20

Changed state: Accepted
Added labels: ****
Removed labels: ****

HaveF / feedparser

Parsing takes minutes (Raspberry Pi) #419