Large output dump to stderr when parsing a "word" with a space in it.

halfak commented 7 years ago

If I parse a sentence that has a space in the word (e.g. "tea pot" from below), I get a huge dump to stderr of the form "## ". It looks like I'm getting roughly 5k lines. I still end up with an acceptable parse. I'd like to be able to turn off this output dump.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bllipparser import RerankingParser
>>> rp = RerankingParser.fetch_and_load('WSJ-PTB3')
rp.parser(">>> 
>>> rp.parse(["I", "am", "a", "teapot"])[0]
ScoredParse('(S1 (S (NP (PRP I)) (VP (VBP am) (NP (DT a) (NN teapot)))))', parser_score=-56.08248356757865, reranker_score=-14.256298669561028)
>>> rp.parse(["I", "am", "a", "tea pot"])[0]
## preterms = ((PRP i) (VBP am) (DT a) (NN tea pot))
## tp = (S1 (S (NP (PRP i)) (VP (VBP am) (NP (DT a) (NN tea pot)))))
## preterms = ((PRP i) (VBP am) (DT a) (JJ tea pot))

... <snip about 5k lines> ...

## tp = (S1 (S (NP (PRP i)) (VP (VBP am) (SBAR (S (NP (NP (DT a)) (NNS tea pot)))))))
## preterms = ((PRP i) (VBP am) (DT a) (JJ tea pot))
## tp = (S1 (S (NP (NP (PRP i)) (VP (VBP am) (NP (DT a) (JJ tea pot))))))
ScoredParse('(S1 (S (NP (PRP I)) (VP (VBP am) (NP (DT a) (NN tea pot)))))', parser_score=-56.08248356757865, reranker_score=-16.96028206416102)

I realize that I'm doing some things that the parser wasn't intended to do, but I'm working with non-clean data, so I'm checking how it behaves if we get something unexpected.

dmcc commented 7 years ago

As far as I know, the Penn Treebank "format" assumes that each token doesn't include any whitespace (inasmuch as it is an official format). I don't think (JJ tea pot) is actually a valid PTB tree (despite being produced by the parser) and it may break other PTB-consuming tools. FWIW, the input validator I was planning to write to fix #49 would prevent tokens from having (or being entirely) whitespace as well.

Workarounds:

Your best bet (in terms of parsing accuracy) is to remove the spaces before they get to the parser if it's at all possible (token.split() should be fine). Is this feasible for your application?
If you want to parse non-clean data in this form, I recommend at least escaping the space in some fashion (e.g., replace it with 3 underscores). Note that the parser and reranker will treat the token tea pot as an unknown word, even if it has seen the words tea and pot before which will hurt accuracy.
The dump comes from the reranker, so you might be able to avoid seeing it by turning off the reranker. However, this will also hurt your quality.

dmcc commented 7 years ago

Added an input validator: f01ade870c39054a116531bf07c35d78ae46cedd so passing tokens with spaces will now be caught before there's a segfault. Marking this as closed for now, but please reopen if any the three workarounds don't work for you.

BLLIP / bllip-parser

Large output dump to stderr when parsing a "word" with a space in it. #50