Closed halfak closed 7 years ago
As far as I know, the Penn Treebank "format" assumes that each token doesn't include any whitespace (inasmuch as it is an official format). I don't think (JJ tea pot)
is actually a valid PTB tree (despite being produced by the parser) and it may break other PTB-consuming tools. FWIW, the input validator I was planning to write to fix #49 would prevent tokens from having (or being entirely) whitespace as well.
Workarounds:
token.split()
should be fine). Is this feasible for your application?tea pot
as an unknown word, even if it has seen the words tea
and pot
before which will hurt accuracy.Added an input validator: f01ade870c39054a116531bf07c35d78ae46cedd so passing tokens with spaces will now be caught before there's a segfault. Marking this as closed for now, but please reopen if any the three workarounds don't work for you.
If I parse a sentence that has a space in the word (e.g. "tea pot" from below), I get a huge dump to stderr of the form "##". It looks like I'm getting roughly 5k lines. I still end up with an acceptable parse. I'd like to be able to turn off this output dump.
I realize that I'm doing some things that the parser wasn't intended to do, but I'm working with non-clean data, so I'm checking how it behaves if we get something unexpected.