Open jowagner opened 11 years ago
It's a problem related to the PTB-style bracketed format.
Rounded brackets should be escaped by a special symbol before training or parsing a file, usually it's -LRB- for ( and -RRB- for ). That's the same for every parser using this format.
By doing so, you prevent malformed trees and spurious ambiguities.
I see that section 5.2. of the Lorg readme specifies that the input brackets should be escaped but since no error message is printed and since the parser output in easy to read for the typical isolated brackets (the first character after the space after a preterminal must be part of the token), the issue is often overlooked by users.
The idea of this feature request is to eventually make this the default behavior (after first testing the output format with other software and then depreciating the old format, printing warning messages that the new format will soon be the default).
I got the idea from Susanne treebank examples (they have spaces around all closing brackets) and this would fix all output ambiguity as tokens cannot have trailing whitespace. (PTB format allows spaces within tokens.)
Extra space would help to recognise tokens correctly, for example for the parser output
it's difficult to build the tree if not using the "### sentence:" line as a reference. In the above example, the s-expression is even ambiguous: The sentence could also have been "tree I)", in which case the NP is attached one level higher (to illustrate, I replace the round bracket as part of a token with a square bracket):
vs.