ClearTK / cleartk

Machine learning components for Apache UIMA
http://cleartk.github.io/cleartk/
Other
129 stars 58 forks source link

Rewrite TreebankFormatParser using JParsec #313

Open bethard opened 9 years ago

bethard commented 9 years ago

Original issue 315 created by ClearTK on 2012-07-25T00:58:05.000Z:

So JParsec (http://jparsec.codehaus.org/) is a nice library for writing recursive-descent parsers. We should rewrite the whole TreebankFormatParser using this approach - that should get rid of a lot of the hacky-ness. Here's some sample code to get started on this:

// create tokenizer that splits into parentheses and words
Terminals parens = Terminals.operators("(", ")");
Parser<?> tagOrWordParser = Scanners.pattern(Patterns.regex("[^\\s()]+"), "tag-or-word").source();
Parser<_> commentParser = Scanners.lineComment("*x*");
Parser<_> whitespaceParser = Scanners.many(CharPredicates.IS_WHITESPACE);
Parser<_> nonTokensParser = Parsers.or(whitespaceParser, commentParser);
Parser<?> tokensParser = Parsers.or(parens.tokenizer(), tagOrWordParser);
Parser<List<Token>> tokenizer = tokensParser.lexer(nonTokensParser);

// create parser that converts "(tag word)" into leaf TreebankNodes
Parser<Token> openParenParser = parens.token("(");
Parser<Token> closeParenParser = parens.token(")");
Parser<String> tagParser = Parsers.tokenType(String.class, "tag");
Parser<String> wordParser = Parsers.tokenType(String.class, "word");
Parser<TreebankNode> leafNodeParser = Parsers.sequence(
    openParenParser.next(tagParser),
    wordParser.followedBy(closeParenParser),
    new Map2<String, String, TreebankNode>() {
      @Override
      public TreebankNode map(String tag, String word) {
        TreebankNode node = new TreebankNode(jCas);
        node.setNodeType(tag);
        node.setNodeValue(word);
        node.setLeaf(true);
        return node;
      }
    });

// create a parser that converts "(tag ...)" into branch TreebankNodes
Parser.Reference<TreebankNode> nodeRef = Parser.newReference();
Parser<TreebankNode> branchNodeParser = Parsers.sequence(
    openParenParser.next(tagParser),
    nodeRef.lazy().many1().followedBy(closeParenParser),
    new Map2<String, List<TreebankNode>, TreebankNode>() {
      @Override
      public TreebankNode map(String tag, List<TreebankNode> children) {
        System.err.printf("(%s ...)\n", tag);
        return null;
      }
    });

// create a parser for leaf nodes or branch nodes (and update its use above)
Parser<TreebankNode> nodeParser = Parsers.or(leafNodeParser, branchNodeParser);
nodeRef.set(nodeParser);

// create a parser that first tokenizes and then parses to TreebankNodes
Parser<TreebankNode> parser = nodeParser.from(tokenizer);

System.err.println(parser.parse("(ghi (abc def) (1 2))"));
bethard commented 9 years ago

Comment #1 originally posted by ClearTK on 2012-07-25T00:58:51.000Z:

Issue 50 has been merged into this issue.

bethard commented 9 years ago

Comment #2 originally posted by ClearTK on 2013-02-17T18:00:28.000Z:

<empty>

bethard commented 9 years ago

Comment #3 originally posted by ClearTK on 2013-05-03T08:44:33.000Z:

<empty>

bethard commented 9 years ago

Comment #4 originally posted by ClearTK on 2013-05-03T08:50:11.000Z:

<empty>

bethard commented 9 years ago

Comment #5 originally posted by ClearTK on 2014-03-15T17:41:52.000Z:

<empty>