CNGLDLab / LORG-Release

LORG is an accurate natural language parser developed in the NCLT at Dublin City University with support from Enterprise Ireland. The parser employs state-of-the-art statistical techniques and a flexible architecture to facilitate adaptation to new languages and new domains.
Other
13 stars 6 forks source link

Feature request: option to add space between token and closing bracket #2

Open jowagner opened 11 years ago

jowagner commented 11 years ago

Extra space would help to recognise tokens correctly, for example for the parser output

### sentence: tree) I
( (S (VP (VB tree)) (NP (PRP I)))))

it's difficult to build the tree if not using the "### sentence:" line as a reference. In the above example, the s-expression is even ambiguous: The sentence could also have been "tree I)", in which case the NP is attached one level higher (to illustrate, I replace the round bracket as part of a token with a square bracket):

( (S (VP (VB tree])
         (NP (PRP I))
)))

vs.

( (S (VP (VB tree))
     (NP (PRP I]))
))
Cocophotos commented 11 years ago

It's a problem related to the PTB-style bracketed format.

Rounded brackets should be escaped by a special symbol before training or parsing a file, usually it's -LRB- for ( and -RRB- for ). That's the same for every parser using this format.

By doing so, you prevent malformed trees and spurious ambiguities.

jowagner commented 11 years ago

I see that section 5.2. of the Lorg readme specifies that the input brackets should be escaped but since no error message is printed and since the parser output in easy to read for the typical isolated brackets (the first character after the space after a preterminal must be part of the token), the issue is often overlooked by users.

The idea of this feature request is to eventually make this the default behavior (after first testing the output format with other software and then depreciating the old format, printing warning messages that the new format will soon be the default).

I got the idea from Susanne treebank examples (they have spaces around all closing brackets) and this would fix all output ambiguity as tokens cannot have trailing whitespace. (PTB format allows spaces within tokens.)