Rewrite TreebankFormatParser using JParsec

bethard commented 9 years ago

Original issue 315 created by ClearTK on 2012-07-25T00:58:05.000Z:

So JParsec (http://jparsec.codehaus.org/) is a nice library for writing recursive-descent parsers. We should rewrite the whole TreebankFormatParser using this approach - that should get rid of a lot of the hacky-ness. Here's some sample code to get started on this:

// create tokenizer that splits into parentheses and words
Terminals parens = Terminals.operators(&quot;(&quot;, &quot;)&quot;);
Parser&lt;?&gt; tagOrWordParser = Scanners.pattern(Patterns.regex(&quot;[^\\s()]+&quot;), &quot;tag-or-word&quot;).source();
Parser&lt;_&gt; commentParser = Scanners.lineComment(&quot;*x*&quot;);
Parser&lt;_&gt; whitespaceParser = Scanners.many(CharPredicates.IS_WHITESPACE);
Parser&lt;_&gt; nonTokensParser = Parsers.or(whitespaceParser, commentParser);
Parser&lt;?&gt; tokensParser = Parsers.or(parens.tokenizer(), tagOrWordParser);
Parser&lt;List&lt;Token&gt;&gt; tokenizer = tokensParser.lexer(nonTokensParser);

// create parser that converts &quot;(tag word)&quot; into leaf TreebankNodes
Parser&lt;Token&gt; openParenParser = parens.token(&quot;(&quot;);
Parser&lt;Token&gt; closeParenParser = parens.token(&quot;)&quot;);
Parser&lt;String&gt; tagParser = Parsers.tokenType(String.class, &quot;tag&quot;);
Parser&lt;String&gt; wordParser = Parsers.tokenType(String.class, &quot;word&quot;);
Parser&lt;TreebankNode&gt; leafNodeParser = Parsers.sequence(
    openParenParser.next(tagParser),
    wordParser.followedBy(closeParenParser),
    new Map2&lt;String, String, TreebankNode&gt;() {
      @Override
      public TreebankNode map(String tag, String word) {
        TreebankNode node = new TreebankNode(jCas);
        node.setNodeType(tag);
        node.setNodeValue(word);
        node.setLeaf(true);
        return node;
      }
    });

// create a parser that converts &quot;(tag ...)&quot; into branch TreebankNodes
Parser.Reference&lt;TreebankNode&gt; nodeRef = Parser.newReference();
Parser&lt;TreebankNode&gt; branchNodeParser = Parsers.sequence(
    openParenParser.next(tagParser),
    nodeRef.lazy().many1().followedBy(closeParenParser),
    new Map2&lt;String, List&lt;TreebankNode&gt;, TreebankNode&gt;() {
      @Override
      public TreebankNode map(String tag, List&lt;TreebankNode&gt; children) {
        System.err.printf(&quot;(%s ...)\n&quot;, tag);
        return null;
      }
    });

// create a parser for leaf nodes or branch nodes (and update its use above)
Parser&lt;TreebankNode&gt; nodeParser = Parsers.or(leafNodeParser, branchNodeParser);
nodeRef.set(nodeParser);

// create a parser that first tokenizes and then parses to TreebankNodes
Parser&lt;TreebankNode&gt; parser = nodeParser.from(tokenizer);

System.err.println(parser.parse(&quot;(ghi (abc def) (1 2))&quot;));

bethard commented 9 years ago

Comment #1 originally posted by ClearTK on 2012-07-25T00:58:51.000Z:

Issue 50 has been merged into this issue.

bethard commented 9 years ago

Comment #2 originally posted by ClearTK on 2013-02-17T18:00:28.000Z:

<empty>

bethard commented 9 years ago

Comment #3 originally posted by ClearTK on 2013-05-03T08:44:33.000Z:

<empty>

bethard commented 9 years ago

Comment #4 originally posted by ClearTK on 2013-05-03T08:50:11.000Z:

<empty>

bethard commented 9 years ago

Comment #5 originally posted by ClearTK on 2014-03-15T17:41:52.000Z:

<empty>

ClearTK / cleartk

Rewrite TreebankFormatParser using JParsec #313