Closed GoogleCodeExporter closed 9 years ago
The following patch fixes this issue for me.
Original comment by nikolay....@gmail.com
on 10 Jan 2010 at 10:54
Attachments:
This patch seems to break:
./tests/wellformed/rss/item_description_not_a_doctype2.xml
with 2.5.2 on Ubuntu.
Do all the tests pass for you?
Original comment by adewale
on 10 Jan 2010 at 7:47
@Nikolay: I installed lxml 2.2.6 and mechanize 0.1.11 on my system as wasn't
able to reproduce the segfault using the latest available feedparser code. The
segfault you posted is occurring in lxml, although having looked through their
site it's apparent that segfaults aren't always their fault (for instance, it's
possible to modify things while lxml is parsing that put it in a state leading
up to a segfault).
Would you download the latest version of feedparser in svn trunk [1] and see if
this has been fixed? If not, would you also try installing the latest version
of lxml to see if the issue has been fixed on their end?
[1]: https://feedparser.googlecode.com/svn/trunk/feedparser/feedparser.py
Original comment by kurtmckee
on 6 Dec 2010 at 6:52
Download [1] and got same issue:
Python 2.5.5 (r255:77872, Feb 1 2010, 19:53:42)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> feedparser.__version__
'4.2-pre--svn'
>>> len(feedparser.parse('/tmp/dirson.xml'))
[1] 20447 segmentation fault python
Same thing upgrade lxml from 2.2.4 to 2.3.beta1. Same with mechanize-0.2.4.
Original comment by nikolay....@gmail.com
on 6 Dec 2010 at 7:13
BTW my patch from comment 1 fixes the issue for me (and break one test for
feedparser).
Original comment by nikolay....@gmail.com
on 6 Dec 2010 at 7:15
I'm actually confused about how lxml and mechanize would be doing anything
here, so I'd like to make sure I know what we're dealing with.
Would you run the diagnosis script I wrote here, copy and paste its output into
a text document, and attach it as a file to this bug? If you paste it directly
into the comment box when replying, it will be more difficult to read. Please
paste it unmodified in its entirety into the text document and attach it as a
file.
Thanks in advance; hopefully the script gives me all the information I need to
start figuring out what's going on!
Original comment by kurtmckee
on 7 Dec 2010 at 6:39
The script fails on:
default_parser_list:
Traceback (most recent call last):
File "help_diagnose_issue_197.py", line 8, in <module>
print "default_parser_list: ", repr(xml.sax.default_parser_list)
AttributeError: 'module' object has no attribute 'default_parser_list'
But if I have commented xml.sax related section, it works.
I have attached the output.
Original comment by nikolay....@gmail.com
on 7 Dec 2010 at 8:13
Attachments:
@Nikolay: Thanks for running the script, I really appreciate it.
I don't know why there's no `default_parser_list` attribute in xml.sax, but
that's OK for me at the moment.
What I'm more interested in is that the test document that's attached to this
bug report parsed correctly (you can look for the text "number of entries
found" in the text file you just uploaded - it appears twice). It appears that
the file you referenced in your tmp directory differs from the sample document
that you originally uploaded to the bug report. Are you certain that the files
are identical?
(Meanwhile, tomorrow I'll look to see if feedparser can make validating parsers
ignore external DTDs, which may be what's causing a problem now that the DTD
URL redirects to netscape.aol.com.)
Original comment by kurtmckee
on 7 Dec 2010 at 10:14
Yes, it parsed correctly if feedparser.PREFERRED_XML_PARSERS = []
And it works in I run help_diagnose_issue_197.py
But if I copy and paste it to python shell, it fails:
Python 2.5.5 (r255:77872, Feb 1 2010, 19:53:42)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> print "feedparser location: ", repr(feedparser.__file__)
feedparser location: 'feedparser.pyc'
>>> print "feedparser version: ", repr(feedparser.__version__)
feedparser version: '4.2-pre--svn'
>>>
len(feedparser.parse("http://feedparser.googlecode.com/issues/attachment?aid=229
4203429072301544&name=dirson.xml&token=2762e53f06d375d55fbd37d51e282f72"))
[1] 3228 segmentation fault python
Same for
http://feedparser.googlecode.com/issues/attachment?aid=2294203429072301544&name=
dirson.xml&token=2762e53f06d375d55fbd37d51e282f72 and /tmp/dirson.xml
(python-env)[15:03 /tmp]$ python
[139] (nik@laptop.niksite.ru)
Python 2.5.5 (r255:77872, Feb 1 2010, 19:53:42)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print "sys.version: ", repr(sys.version)
sys.version: '2.5.5 (r255:77872, Feb 1 2010, 19:53:42) \n[GCC 4.4.3]'
>>> print "sys.path:"
sys.path:
>>> for i in sys.path:
... print repr(i)
...
''
'/home/niksite/python-env/src/python-twitter'
'/home/niksite/python-env/src/python-oauth2'
'/tmp'
'/home/niksite/webapps/django/datamining'
'/home/niksite/webapps/django/datamining/dist'
'/home/niksite/webapps/django/datamining/apps'
'/home/niksite/python-env/lib/python2.5'
'/home/niksite/python-env/lib/python2.5/plat-linux2'
'/home/niksite/python-env/lib/python2.5/lib-tk'
'/home/niksite/python-env/lib/python2.5/lib-dynload'
'/usr/lib/python2.5'
'/usr/lib/python2.5/plat-linux2'
'/usr/lib/python2.5/lib-tk'
'/home/niksite/python-env/lib/python2.5/site-packages'
'/home/niksite/webapps/django'
'/usr/local/lib/python2.5/site-packages'
'/usr/lib/python2.5/site-packages'
'/usr/lib/python2.5/site-packages/Numeric'
'/usr/lib/python2.5/site-packages/PIL'
'/usr/lib/pymodules/python2.5'
'/usr/lib/pymodules/python2.5/gtk-2.0'
'/usr/lib/python2.5/site-packages/wx-2.6-gtk2-unicode'
>>> import xml.sax
>>> # print "default_parser_list: ", repr(xml.sax.default_parser_list)
... # p = xml.sax.make_parser([])
... # print "1__doc__: ", repr(getattr(p, '__doc__', None))
... # print "1__module__: ", repr(getattr(p, '__module__', None))
... # print "1__file__: ", repr(getattr(p, '__file__', None))
... # p = xml.sax.make_parser(['drv_libxml2'])
... # print "2__doc__: ", repr(getattr(p, '__doc__', None))
... # print "2__module__: ", repr(getattr(p, '__module__', None))
... # print "2__file__: ", repr(getattr(p, '__file__', None))
...
>>> import feedparser
>>> print "feedparser location: ", repr(feedparser.__file__)
feedparser location: 'feedparser.pyc'
>>> print "feedparser version: ", repr(feedparser.__version__)
feedparser version: '4.2-pre--svn'
>>>
>>> # Try using a default XML parser, listed above
... feedparser.PREFERRED_XML_PARSERS = []
>>> # Download and parse the xml file named dirson.xml attached to feedparser
issue 197
... a = feedparser.parse("/tmp/dirson.xml")
>>> print "1 number of entries found: ", repr(len(a.entries))
1 number of entries found: 15
>>>
>>> # Change PREFERRED_XML_PARSERS back to its default
... feedparser.PREFERRED_XML_PARSERS = ["drv_libxml2"]
>>> # Turn on debugging
... feedparser._debug = 1
>>> # Download and parse the xml file named dirson.xml attached to feedparser
issue 197
... a = feedparser.parse("/tmp/dirson.xml")
entering _toUTF8, trying encoding iso-8859-1
successfully converted iso-8859-1 data to unicode
trying StrictFeedParser
initializing FeedParser
[1] 20925 segmentation fault python
Original comment by nikolay....@gmail.com
on 7 Dec 2010 at 12:08
What.
I'm not going to give up on this, but that's not what I wanted to read, heh.
Would you be willing to run another diagnosis script? I've included a call to
get the current working directory, and have wrapped all of the xml.sax stuff so
that it *should* run without giving an error.
I'd like to have the output both from running the file, and from copying and
pasting the commands into the Python console. My goal is to figure out what XML
parser is being used by feedparser, and hopefully why there's a difference
between pasting the commands in versus running the file.
Original comment by kurtmckee
on 7 Dec 2010 at 7:37
Attachments:
Ok, btw I see no difference.
Original comment by nikolay....@gmail.com
on 7 Dec 2010 at 7:57
Attachments:
OK, I have found the difference.
In my .pythonstartup I have 'import lxml.html'
I have just added this line to help_diagnose_issue_197.py and got the following:
(python-env)[23:06 /tmp]$ python help_diagnose_issue_197.py
[0] (nik@laptop.niksite.ru)
sys.version: '2.5.5 (r255:77872, Feb 1 2010, 19:53:42) \n[GCC 4.4.3]'
sys.path:
'/tmp'
'/home/niksite/python-env/src/python-twitter'
'/home/niksite/python-env/src/python-oauth2'
'/tmp'
'/home/niksite/webapps/django/datamining'
'/home/niksite/webapps/django/datamining/dist'
'/home/niksite/webapps/django/datamining/apps'
'/home/niksite/python-env/lib/python2.5'
'/home/niksite/python-env/lib/python2.5/plat-linux2'
'/home/niksite/python-env/lib/python2.5/lib-tk'
'/home/niksite/python-env/lib/python2.5/lib-dynload'
'/usr/lib/python2.5'
'/usr/lib/python2.5/plat-linux2'
'/usr/lib/python2.5/lib-tk'
'/home/niksite/python-env/lib/python2.5/site-packages'
'/home/niksite/webapps/django'
'/usr/local/lib/python2.5/site-packages'
'/usr/lib/python2.5/site-packages'
'/usr/lib/python2.5/site-packages/Numeric'
'/usr/lib/python2.5/site-packages/PIL'
'/usr/lib/pymodules/python2.5'
'/usr/lib/pymodules/python2.5/gtk-2.0'
'/usr/lib/python2.5/site-packages/wx-2.6-gtk2-unicode'
feedparser location: '/tmp/feedparser.pyc'
feedparser version: '4.2-pre--svn'
1 number of entries found: 15
entering _toUTF8, trying encoding iso-8859-1
successfully converted iso-8859-1 data to unicode
trying StrictFeedParser
initializing FeedParser
[1] 32282 segmentation fault python help_diagnose_issue_197.py
Without this line, there is no fault.
Original comment by nikolay....@gmail.com
on 7 Dec 2010 at 8:09
Okay, I'm now able to recreate the segmentation fault. This isn't a feedparser
bug, as I've demonstrated in the attached sample script. If lxml.html is
imported at any time, it causes a segfault in libxml2 when parsing dirson.xml.
I recommend filing a bug with the lxml project; according to their FAQ [1] you
will need to post a message to their mailing list, although their mailing list
page says that the bug should be filed in the bug tracker [2], which may
already have a bug listed about this [3].
It may be appropriate to link to this bug report either in the open bug report
I linked to in [3], or when opening a new bug or posting to their mailing list.
[1]: http://codespeak.net/lxml/FAQ.html#bugs
[2]: https://bugs.launchpad.net/lxml
[3]: https://bugs.launchpad.net/lxml/+bug/502959
@Adewale: Please close this bug.
Original comment by kurtmckee
on 7 Dec 2010 at 9:46
Attachments:
Original comment by adewale
on 13 Dec 2010 at 1:42
Original issue reported on code.google.com by
nikolay....@gmail.com
on 5 Jan 2010 at 7:55Attachments: