BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.
http://bllip.cs.brown.edu/
227 stars 53 forks source link

Segmentation fault when parsing non-clean sentence #49

Closed halfak closed 7 years ago

halfak commented 8 years ago

I get a segmentation fault when parsing ["I", "am", "a", "little", "teapot", ".", " ", "What", "?"] using WSJ-PTB3. See the repl paste below.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bllipparser
>>> bllipparser.__version__
'2016.9.11'
>>> from bllipparser import RerankingParser
>>> rp = RerankingParser.fetch_and_load('WSJ-PTB3', verbose=True)
Model directory: /home/halfak/.local/share/bllipparser/WSJ-PTB3
Model directory already exists, not reinstalling
>>> rp.parse(["I", "am", "a", "little", "teapot"])[0]
ScoredParse('(S1 (S (NP (PRP I)) (VP (VBP am) (NP (DT a) (JJ little) (NN teapot)))))', parser_score=-64.30434900543281, reranker_score=-16.740114175058775)
>>> rp.parse(["I", "am", "a", "little", "teapot", ".", " ", "What", "?"])[0]
## preterms = ((PRP i) (VBP am) (DT a) (JJ little) (NN teapot) (. .) (VP vbz (NP (WP what))) (. ?))
## tp = (S1 (S (S (NP (PRP i)) (VP (VBP am) (NP (DT a) (JJ little) (NN teapot))) (. .)) (VP vbz (NP (WP what))) (. ?)))
Segmentation fault (core dumped)

I'm running Ubuntu 16.04 64bit.

$ uname -a
Linux graphite 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I found this in /var/log/syslog:

Sep 13 20:24:13 graphite kernel: [625702.799534] python[9585]: segfault at 40 ip 00007f87e0ff21c0 sp 00007fff6cf55e20 error 4 in _JohnsonReranker.cpython-35m-x86_64-linux-gnu.so[7f87e0fa9000+92000]
dmcc commented 8 years ago

Thanks for the report!

My suspicion is that the bridge between the parser and reranker is not handling the space token correctly:

>>> rp.parse(["I", "am", "a", "little", "teapot", ".", " ", "What", "?"], rerank=False)[0]
ScoredParse('(S1 (S (S (NP (PRP I)) (VP (VBP am) (NP (DT a) (JJ little) (NN teapot))) (. .)) (VP (VBZ  ) (NP (WP What))) (. ?)))', parser_score=-128.58688297955965, reranker_score=None)
>>> rp.parse(["I", "am", "a", "little", "teapot", ".", "What", "?"])[0]
ScoredParse('(S1 (S (NP (PRP I)) (VP (VBP am) (FRAG (NP (DT a) (JJ little) (NN teapot)) (. .) (WHNP (WP What)))) (. ?)))', parser_score=-105.83222024880615, reranker_score=-30.588081364935757)
>>> rp.parse(["a", " ", "b"])
zsh: segmentation fault (core dumped)  python

I'll add an input validator to avoid future crashes, but as a workaround, I recommend removing any tokens that are purely whitespace when you're using the pre-tokenized mode (it's not clear what the part of speech for whitespace is, or the overall parse for that matter).

halfak commented 8 years ago

Seems like an easy enough workaround. Thanks.

dmcc commented 7 years ago

(Finally) added an input validator: f01ade870c39054a116531bf07c35d78ae46cedd