fangfangli / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

Rewrite TreebankFormatParser using JParsec #315

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
So JParsec (http://jparsec.codehaus.org/) is a nice library for writing 
recursive-descent parsers. We should rewrite the whole TreebankFormatParser 
using this approach - that should get rid of a lot of the hacky-ness. Here's 
some sample code to get started on this:

    // create tokenizer that splits into parentheses and words
    Terminals parens = Terminals.operators("(", ")");
    Parser<?> tagOrWordParser = Scanners.pattern(Patterns.regex("[^\\s()]+"), "tag-or-word").source();
    Parser<_> commentParser = Scanners.lineComment("*x*");
    Parser<_> whitespaceParser = Scanners.many(CharPredicates.IS_WHITESPACE);
    Parser<_> nonTokensParser = Parsers.or(whitespaceParser, commentParser);
    Parser<?> tokensParser = Parsers.or(parens.tokenizer(), tagOrWordParser);
    Parser<List<Token>> tokenizer = tokensParser.lexer(nonTokensParser);

    // create parser that converts "(tag word)" into leaf TreebankNodes
    Parser<Token> openParenParser = parens.token("(");
    Parser<Token> closeParenParser = parens.token(")");
    Parser<String> tagParser = Parsers.tokenType(String.class, "tag");
    Parser<String> wordParser = Parsers.tokenType(String.class, "word");
    Parser<TreebankNode> leafNodeParser = Parsers.sequence(
        openParenParser.next(tagParser),
        wordParser.followedBy(closeParenParser),
        new Map2<String, String, TreebankNode>() {
          @Override
          public TreebankNode map(String tag, String word) {
            TreebankNode node = new TreebankNode(jCas);
            node.setNodeType(tag);
            node.setNodeValue(word);
            node.setLeaf(true);
            return node;
          }
        });

    // create a parser that converts "(tag ...)" into branch TreebankNodes
    Parser.Reference<TreebankNode> nodeRef = Parser.newReference();
    Parser<TreebankNode> branchNodeParser = Parsers.sequence(
        openParenParser.next(tagParser),
        nodeRef.lazy().many1().followedBy(closeParenParser),
        new Map2<String, List<TreebankNode>, TreebankNode>() {
          @Override
          public TreebankNode map(String tag, List<TreebankNode> children) {
            System.err.printf("(%s ...)\n", tag);
            return null;
          }
        });

    // create a parser for leaf nodes or branch nodes (and update its use above)
    Parser<TreebankNode> nodeParser = Parsers.or(leafNodeParser, branchNodeParser);
    nodeRef.set(nodeParser);

    // create a parser that first tokenizes and then parses to TreebankNodes
    Parser<TreebankNode> parser = nodeParser.from(tokenizer);

    System.err.println(parser.parse("(ghi (abc def) (1 2))"));

Original issue reported on code.google.com by steven.b...@gmail.com on 25 Jul 2012 at 12:58

GoogleCodeExporter commented 9 years ago
Issue 50 has been merged into this issue.

Original comment by steven.b...@gmail.com on 25 Jul 2012 at 12:58

GoogleCodeExporter commented 9 years ago

Original comment by lee.becker on 17 Feb 2013 at 6:00

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 3 May 2013 at 8:44

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 3 May 2013 at 8:50

GoogleCodeExporter commented 9 years ago

Original comment by phi...@ogren.info on 15 Mar 2014 at 5:41