BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.
http://bllip.cs.brown.edu/
227 stars 53 forks source link

Segmentation fault in `InputTree::printproper` when using a comprehension or loop to collect the heads of parse trees #65

Open brady-ds opened 5 years ago

brady-ds commented 5 years ago

The three lines of code below reliably provoke a segfault for me:

Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bllipparser import RerankingParser, Tree
>>> parser = RerankingParser.fetch_and_load('WSJ-PTB3')
>>> [Tree(str(parse.ptb_parse)).head() for parse in parser.parse('singing stars')]
Segmentation fault

(Apologies for the strange choice of text to parse; I had difficulty finding a text that would provoke a crash instead of a hang, at least on my machine.)

In case it is useful, a backtrace is given below:

GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...(no debugging symbols found)...done.
(gdb) r
Starting program: /usr/bin/python3 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bllipparser import RerankingParser, Tree
>>> parser = RerankingParser.fetch_and_load('WSJ-PTB3')
>>> [Tree(str(parse.ptb_parse)).head() for parse in parser.parse('singing stars')]

Program received signal SIGSEGV, Segmentation fault.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:423
423 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:423
#1  0x00007ffff421c578 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007ffff420ccb4 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff4536cd5 in std::operator<< <char, std::char_traits<char>, std::allocator<char> > (
    __str=<error: Cannot access memory at address 0x21>, __os=...) at /usr/include/c++/7/bits/basic_string.h:6289
#4  InputTree::printproper (this=0xd5bbbc0, os=...) at first-stage/PARSE/InputTree.C:318
#5  0x00007ffff4536c9b in InputTree::printproper (this=0xd5ea050, os=...) at first-stage/PARSE/InputTree.C:330
#6  0x00007ffff4536c9b in InputTree::printproper (this=0xd5e9d30, os=...) at first-stage/PARSE/InputTree.C:330
#7  0x00007ffff44d31e6 in InputTree_toString (self=0xd5e9d30) at first-stage/PARSE/swig/wrapper.C:6389
#8  _wrap_InputTree_toString (self=<optimized out>, args=<optimized out>)
    at first-stage/PARSE/swig/wrapper.C:22318
#9  0x0000000000502d6f in ?? ()
#10 0x0000000000506859 in _PyEval_EvalFrameDefault ()
#11 0x0000000000501945 in _PyFunction_FastCallDict ()
#12 0x0000000000591461 in ?? ()
#13 0x00000000005a337c in _PyObject_FastCallDict ()
#14 0x0000000000544f0a in ?? ()
#15 0x0000000000563e3e in PyObject_Str ()
#16 0x00000000005240e5 in ?? ()
#17 0x00000000005553b5 in ?? ()
#18 0x00000000005a730c in _PyObject_FastCallKeywords ()
#19 0x0000000000503073 in ?? ()
#20 0x0000000000506859 in _PyEval_EvalFrameDefault ()
#21 0x0000000000501945 in _PyFunction_FastCallDict ()
#22 0x0000000000591461 in ?? ()
#23 0x00000000005a337c in _PyObject_FastCallDict ()
#24 0x000000000061a398 in ?? ()
#25 0x0000000000563cc1 in PyObject_Repr ()
#26 0x0000000000585acd in ?? ()
#27 0x0000000000563cc1 in PyObject_Repr ()
#28 0x00000000006253ec in PyFile_WriteObject ()
#29 0x000000000063409c in ?? ()
#30 0x0000000000565bd1 in _PyCFunction_FastCallDict ()
#31 0x0000000000599680 in PyObject_CallFunctionObjArgs ()
#32 0x000000000050a4b1 in _PyEval_EvalFrameDefault ()
#33 0x0000000000504c28 in ?? ()
#34 0x0000000000506393 in PyEval_EvalCode ()
#35 0x0000000000634d52 in ?? ()
#36 0x00000000004a38c5 in ?? ()
#37 0x00000000004a5cd5 in PyRun_InteractiveLoopFlags ()
#38 0x00000000006387b3 in PyRun_AnyFileExFlags ()
#39 0x000000000063915a in Py_Main ()
#40 0x00000000004a6f10 in main ()

Incidentally, is there a reason why just

Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bllipparser import RerankingParser
>>> parser = RerankingParser.fetch_and_load('WSJ-PTB3')
>>> [parse.ptb_parse.head() for parse in parser.parse('singing stars')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/local/lib/python3.6/dist-packages/bllipparser/RerankingParser.py", line 169, in head
    return self.__class__(self._tree.headTree())
  File "/usr/local/lib/python3.6/dist-packages/bllipparser/RerankingParser.py", line 39, in __init__
    input_tree_or_string)
TypeError: input_tree_or_string (None) must be an InputTree or string.
>>> 

is not supported?

dmcc commented 5 years ago

Thanks for the very thorough report! I'm afraid my answers to both are not great, since my knowledge of this code is quickly rotting. I'll have to leave this open and see if I can dig into it more later, but here's some information for the curious/brave:

For the first part, it looks like there's a memory error in my SWIG wrapper, unfortunately. I haven't been able to track it down, but you can likely get around it by unwrapping the list comprehension into a for-loop. A lightly tested (possible) fix) is to add a disown() after the inputTreeFromString in RerankingParser.py:

        input_tree_or_string = \
            parser.inputTreeFromString(input_tree_or_string)
        input_tree_or_string.this.disown() # <<< ADD THIS LINE

For the second part (parse.ptb_parse.head()) -- agreed, this should totally work. Internally, there are different ways of constructing the underlying InputTree objects and it seems like the one straight from the parser hasn't run the headfinder for some reason. When you reread them from a string, that activates head percolation. I'm guessing it only cares about the headfinder for training, which is where it loads from strings instead of constructing them on its own.