dimones / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

segmentation fault during feed parsing #197

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I have tried to parse a feed from http://google.dirson.com/rss.php:
$ python
Python 2.5.4 (r254:67916, Nov 19 2009, 19:46:21) 
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> feedparser.__version__
'4.2-pre-303-svn'
>>> len(feedparser.parse('/tmp/dirson.xml'))
[1]    16610 segmentation fault  python

It is very strange, but I see this issue only when lxml-2.2.4 and
mechanize-0.1.11 are installed (same thing if installed httplib2).

BTW, I have done strace on python interpreter and the last lines are the
following:
stat64("http://my.netscape.com/publish/formats/rss-0.91.dtd", 0xbfb4f844) =
-1 ENOENT (No such file or directory)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++

Then I have removed DTD declaration from the feed and got the following:
$ python
Python 2.5.4 (r254:67916, Nov 19 2009, 19:46:21) 
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> feedparser.__version__
'4.2-pre-303-svn'
>>> len(feedparser.parse('/tmp/dirson-nodtd.xml'))
6

I have also tested this feed with
http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fgoogle.dirson.com%2Frss.
php
and got the following message: "The use of this DTD has been deprecated by
Netscape".

So, it seems that something wrong with deprecated DTD parsing.

Original issue reported on code.google.com by nikolay....@gmail.com on 5 Jan 2010 at 7:55

Attachments:

GoogleCodeExporter commented 9 years ago
The following patch fixes this issue for me.

Original comment by nikolay....@gmail.com on 10 Jan 2010 at 10:54

Attachments:

GoogleCodeExporter commented 9 years ago
This patch seems to break: 
./tests/wellformed/rss/item_description_not_a_doctype2.xml 
with 2.5.2 on Ubuntu.

Do all the tests pass for you?

Original comment by adewale on 10 Jan 2010 at 7:47

GoogleCodeExporter commented 9 years ago
@Nikolay: I installed lxml 2.2.6 and mechanize 0.1.11 on my system as wasn't 
able to reproduce the segfault using the latest available feedparser code. The 
segfault you posted is occurring in lxml, although having looked through their 
site it's apparent that segfaults aren't always their fault (for instance, it's 
possible to modify things while lxml is parsing that put it in a state leading 
up to a segfault).

Would you download the latest version of feedparser in svn trunk [1] and see if 
this has been fixed? If not, would you also try installing the latest version 
of lxml to see if the issue has been fixed on their end?

[1]: https://feedparser.googlecode.com/svn/trunk/feedparser/feedparser.py

Original comment by kurtmckee on 6 Dec 2010 at 6:52

GoogleCodeExporter commented 9 years ago
Download [1] and got same issue:

Python 2.5.5 (r255:77872, Feb  1 2010, 19:53:42) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> feedparser.__version__
'4.2-pre--svn'
>>> len(feedparser.parse('/tmp/dirson.xml'))
[1]    20447 segmentation fault  python

Same thing upgrade lxml from 2.2.4 to 2.3.beta1. Same with mechanize-0.2.4.

Original comment by nikolay....@gmail.com on 6 Dec 2010 at 7:13

GoogleCodeExporter commented 9 years ago
BTW my patch from comment 1 fixes the issue for me (and break one test for 
feedparser).

Original comment by nikolay....@gmail.com on 6 Dec 2010 at 7:15

GoogleCodeExporter commented 9 years ago
I'm actually confused about how lxml and mechanize would be doing anything 
here, so I'd like to make sure I know what we're dealing with.

Would you run the diagnosis script I wrote here, copy and paste its output into 
a text document, and attach it as a file to this bug? If you paste it directly 
into the comment box when replying, it will be more difficult to read. Please 
paste it unmodified in its entirety into the text document and attach it as a 
file.

Thanks in advance; hopefully the script gives me all the information I need to 
start figuring out what's going on!

Original comment by kurtmckee on 7 Dec 2010 at 6:39

GoogleCodeExporter commented 9 years ago

Original comment by kurtmckee on 7 Dec 2010 at 6:39

Attachments:

GoogleCodeExporter commented 9 years ago
The script fails on:

default_parser_list: 
Traceback (most recent call last):
  File "help_diagnose_issue_197.py", line 8, in <module>
    print "default_parser_list: ", repr(xml.sax.default_parser_list)
AttributeError: 'module' object has no attribute 'default_parser_list'

But if I have commented xml.sax related section, it works.
I have attached the output.

Original comment by nikolay....@gmail.com on 7 Dec 2010 at 8:13

Attachments:

GoogleCodeExporter commented 9 years ago
@Nikolay: Thanks for running the script, I really appreciate it.

I don't know why there's no `default_parser_list` attribute in xml.sax, but 
that's OK for me at the moment.

What I'm more interested in is that the test document that's attached to this 
bug report parsed correctly (you can look for the text "number of entries 
found" in the text file you just uploaded - it appears twice). It appears that 
the file you referenced in your tmp directory differs from the sample document 
that you originally uploaded to the bug report. Are you certain that the files 
are identical?

(Meanwhile, tomorrow I'll look to see if feedparser can make validating parsers 
ignore external DTDs, which may be what's causing a problem now that the DTD 
URL redirects to netscape.aol.com.)

Original comment by kurtmckee on 7 Dec 2010 at 10:14

GoogleCodeExporter commented 9 years ago
Yes, it parsed correctly if feedparser.PREFERRED_XML_PARSERS = []
And it works in I run help_diagnose_issue_197.py
But if I copy and paste it to python shell, it fails:

Python 2.5.5 (r255:77872, Feb  1 2010, 19:53:42) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> print "feedparser location: ", repr(feedparser.__file__)
feedparser location:  'feedparser.pyc'
>>> print "feedparser version: ", repr(feedparser.__version__)
feedparser version:  '4.2-pre--svn'
>>> 
len(feedparser.parse("http://feedparser.googlecode.com/issues/attachment?aid=229
4203429072301544&name=dirson.xml&token=2762e53f06d375d55fbd37d51e282f72"))
[1]    3228 segmentation fault  python

Same for 
http://feedparser.googlecode.com/issues/attachment?aid=2294203429072301544&name=
dirson.xml&token=2762e53f06d375d55fbd37d51e282f72 and /tmp/dirson.xml

(python-env)[15:03 /tmp]$ python                                                

[139] (nik@laptop.niksite.ru)
Python 2.5.5 (r255:77872, Feb  1 2010, 19:53:42) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print "sys.version: ", repr(sys.version)
sys.version:  '2.5.5 (r255:77872, Feb  1 2010, 19:53:42) \n[GCC 4.4.3]'
>>> print "sys.path:"
sys.path:
>>> for i in sys.path:
...     print repr(i)
... 
''
'/home/niksite/python-env/src/python-twitter'
'/home/niksite/python-env/src/python-oauth2'
'/tmp'
'/home/niksite/webapps/django/datamining'
'/home/niksite/webapps/django/datamining/dist'
'/home/niksite/webapps/django/datamining/apps'
'/home/niksite/python-env/lib/python2.5'
'/home/niksite/python-env/lib/python2.5/plat-linux2'
'/home/niksite/python-env/lib/python2.5/lib-tk'
'/home/niksite/python-env/lib/python2.5/lib-dynload'
'/usr/lib/python2.5'
'/usr/lib/python2.5/plat-linux2'
'/usr/lib/python2.5/lib-tk'
'/home/niksite/python-env/lib/python2.5/site-packages'
'/home/niksite/webapps/django'
'/usr/local/lib/python2.5/site-packages'
'/usr/lib/python2.5/site-packages'
'/usr/lib/python2.5/site-packages/Numeric'
'/usr/lib/python2.5/site-packages/PIL'
'/usr/lib/pymodules/python2.5'
'/usr/lib/pymodules/python2.5/gtk-2.0'
'/usr/lib/python2.5/site-packages/wx-2.6-gtk2-unicode'
>>> import xml.sax
>>> # print "default_parser_list: ", repr(xml.sax.default_parser_list)
... # p = xml.sax.make_parser([])
... # print "1__doc__: ", repr(getattr(p, '__doc__', None))
... # print "1__module__: ", repr(getattr(p, '__module__', None))
... # print "1__file__: ", repr(getattr(p, '__file__', None))
... # p = xml.sax.make_parser(['drv_libxml2'])
... # print "2__doc__: ", repr(getattr(p, '__doc__', None))
... # print "2__module__: ", repr(getattr(p, '__module__', None))
... # print "2__file__: ", repr(getattr(p, '__file__', None))
... 
>>> import feedparser
>>> print "feedparser location: ", repr(feedparser.__file__)
feedparser location:  'feedparser.pyc'
>>> print "feedparser version: ", repr(feedparser.__version__)
feedparser version:  '4.2-pre--svn'
>>> 
>>> # Try using a default XML parser, listed above
... feedparser.PREFERRED_XML_PARSERS = []
>>> # Download and parse the xml file named dirson.xml attached to feedparser 
issue 197
... a = feedparser.parse("/tmp/dirson.xml")
>>> print "1 number of entries found: ", repr(len(a.entries))
1 number of entries found:  15
>>> 
>>> # Change PREFERRED_XML_PARSERS back to its default
... feedparser.PREFERRED_XML_PARSERS = ["drv_libxml2"]
>>> # Turn on debugging
... feedparser._debug = 1
>>> # Download and parse the xml file named dirson.xml attached to feedparser 
issue 197
... a = feedparser.parse("/tmp/dirson.xml")
entering _toUTF8, trying encoding iso-8859-1
successfully converted iso-8859-1 data to unicode
trying StrictFeedParser
initializing FeedParser
[1]    20925 segmentation fault  python

Original comment by nikolay....@gmail.com on 7 Dec 2010 at 12:08

GoogleCodeExporter commented 9 years ago
What.

I'm not going to give up on this, but that's not what I wanted to read, heh. 
Would you be willing to run another diagnosis script? I've included a call to 
get the current working directory, and have wrapped all of the xml.sax stuff so 
that it *should* run without giving an error.

I'd like to have the output both from running the file, and from copying and 
pasting the commands into the Python console. My goal is to figure out what XML 
parser is being used by feedparser, and hopefully why there's a difference 
between pasting the commands in versus running the file.

Original comment by kurtmckee on 7 Dec 2010 at 7:37

Attachments:

GoogleCodeExporter commented 9 years ago
Ok, btw I see no difference.

Original comment by nikolay....@gmail.com on 7 Dec 2010 at 7:57

Attachments:

GoogleCodeExporter commented 9 years ago
OK, I have found the difference.
In my .pythonstartup I have 'import lxml.html'
I have just added this line to help_diagnose_issue_197.py and got the following:

(python-env)[23:06 /tmp]$ python help_diagnose_issue_197.py                     

[0] (nik@laptop.niksite.ru)
sys.version:  '2.5.5 (r255:77872, Feb  1 2010, 19:53:42) \n[GCC 4.4.3]'
sys.path:
'/tmp'
'/home/niksite/python-env/src/python-twitter'
'/home/niksite/python-env/src/python-oauth2'
'/tmp'
'/home/niksite/webapps/django/datamining'
'/home/niksite/webapps/django/datamining/dist'
'/home/niksite/webapps/django/datamining/apps'
'/home/niksite/python-env/lib/python2.5'
'/home/niksite/python-env/lib/python2.5/plat-linux2'
'/home/niksite/python-env/lib/python2.5/lib-tk'
'/home/niksite/python-env/lib/python2.5/lib-dynload'
'/usr/lib/python2.5'
'/usr/lib/python2.5/plat-linux2'
'/usr/lib/python2.5/lib-tk'
'/home/niksite/python-env/lib/python2.5/site-packages'
'/home/niksite/webapps/django'
'/usr/local/lib/python2.5/site-packages'
'/usr/lib/python2.5/site-packages'
'/usr/lib/python2.5/site-packages/Numeric'
'/usr/lib/python2.5/site-packages/PIL'
'/usr/lib/pymodules/python2.5'
'/usr/lib/pymodules/python2.5/gtk-2.0'
'/usr/lib/python2.5/site-packages/wx-2.6-gtk2-unicode'
feedparser location:  '/tmp/feedparser.pyc'
feedparser version:  '4.2-pre--svn'
1 number of entries found:  15
entering _toUTF8, trying encoding iso-8859-1
successfully converted iso-8859-1 data to unicode
trying StrictFeedParser
initializing FeedParser
[1]    32282 segmentation fault  python help_diagnose_issue_197.py

Without this line, there is no fault.

Original comment by nikolay....@gmail.com on 7 Dec 2010 at 8:09

GoogleCodeExporter commented 9 years ago
Okay, I'm now able to recreate the segmentation fault. This isn't a feedparser 
bug, as I've demonstrated in the attached sample script. If lxml.html is 
imported at any time, it causes a segfault in libxml2 when parsing dirson.xml.

I recommend filing a bug with the lxml project; according to their FAQ [1] you 
will need to post a message to their mailing list, although their mailing list 
page says that the bug should be filed in the bug tracker [2], which may 
already have a bug listed about this [3].

It may be appropriate to link to this bug report either in the open bug report 
I linked to in [3], or when opening a new bug or posting to their mailing list.

[1]: http://codespeak.net/lxml/FAQ.html#bugs
[2]: https://bugs.launchpad.net/lxml
[3]: https://bugs.launchpad.net/lxml/+bug/502959

@Adewale: Please close this bug.

Original comment by kurtmckee on 7 Dec 2010 at 9:46

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by adewale on 13 Dec 2010 at 1:42