DatabaseGroup / apted

APTED algorithm for the Tree Edit Distance
http://tree-edit-distance.dbresearch.uni-salzburg.at/
MIT License
115 stars 34 forks source link

Support escaping curly braces in the input #1

Closed cagdasgerede closed 5 years ago

cagdasgerede commented 7 years ago

Trees with node labels that contain '{' and '}' characters cannot be directly provided to this tool at the moment. It would nice to support a way to escape these characters.

One way would be to update LblTree.fromString method so that it updated the input by replacing, for instance "{" and "}' characters with "\opencurl" and "\closecurl". Also replace "\" characters, for instance, with "\slash". Then it can continue building the associated trees.

public static LblTree fromString(String s)
    {
        int treeID = FormatUtilities.getTreeID(s);
        // Implement escaping here
        s = s.substring(s.indexOf("{"), s.lastIndexOf("}") + 1);
mateuszpawlik commented 7 years ago

I'm in the process of major code refactoring. One of the items on my list is the input parser. I will work on the bracket notation parser and try to make it more flexible. I'm thinking about something like the example by Paul Hankin from Stackoverflow. This doesn't solve the escaping but I'd like to start with a solid basis and writing a grammar sounds like one. You're of course welcome to contribute.

cagdasgerede commented 7 years ago

What type of grammar do you have in mind? If you write down the pseudo context free grammar you have in mind, I can easily implement it in a more generic, flexible parser using antlr4.

By the way, I am not sure why the notation is called the bracket notation because bracket symbols are '[' and ']', not '{' and '}'

mateuszpawlik commented 7 years ago

What type of grammar do you have in mind? If you write down the pseudo context free grammar you have in mind, I can easily implement it in a more generic, flexible parser using antlr4.

I've looked at antlr4 and it looks like a powerful tool. I'm not sure if it's not an overkill. Currently I'd like to support only the bracket notation as an input with the possibility to specify which parentheses type is used. And maybe support also escaping the parentheses symbols, as you mentioned.

Though, I have to merge in my cost-model branch and document nicely the changes such that I can explain what should be the output of parsing.

By the way, I am not sure why the notation is called the bracket notation because bracket symbols are '[' and ']', not '{' and '}'

It's only an inherited name. According to Oxford Dictionary of English on my mac:

A bracket is each of a pair of marks ( ) [ ] { } < > used to enclose words or figures so as to separate them from the context: symbols are given in brackets.

However, it may be true that < > are not used so often.

cagdasgerede commented 7 years ago

IMHO, I disagree that it is an overkill.

Brackets are not the only way to represent a tree. Someone could, for instance, use YAML format to represent a tree. There is also, for instance, lisp-style representation for trees which also use "(" and ")" but it is different than the format the tool currently supports. The tree {a{b}{c}} in the current notation would be represented as (a (b c)) in that style, not as (a(b)(c)) - simply replacing "{" with "(" would not work.

The tool could work with a custom grammar to represent the user's tree notation and an associated parser generated from the grammar via antlr. Otherwise, almost always the users would need to write a converter to convert from their representation to the bracket notation the tool supports.

mateuszpawlik commented 7 years ago

I agree and I like the idea of a unified interface. My goal is to make the tool available for various input types. You gave a few examples. Others include XML, JSON, parse trees, abstract syntax trees of source code, RNA secondary structures, mathematical equations, and more.

In the cost-model branch you can see Node class and an example of node data (StringNodeData). Maybe supporting custom grammars is not enough, and it would be nicer to give some hints of defining a grammar which result maps to Node and CustomNodeData. Do you have any ideas?

mateuszpawlik commented 7 years ago

I've merged all changes I wanted. I was reading about antlr. It is a powerful tool. I even wrote a simple grammar for the bracket notation:

grammar BracketNotation;
node : '{' LABEL node* '}' | '{' LABEL '}';
LABEL : [a-z0-9]+ ;
WS : [ \t\r\n]+ -> skip;

But I still don't understand the way from my grammar to the Node in APTED. Any thought?

cagdasgerede commented 7 years ago

I am not sure what you are asking exactly.

Are you asking the mechanics of how an antlr grammar would be used to generate a parser? Antlr generates a listener or visitor class from your grammar automatically. Then, you override the base listener to put your actions inside those enter/exit (or visit) methods. (Typical visitor/listener design pattern)

class BracketVisitor extends BracketNotationBaseListener {

  void enterNode(NodeContext ctx) {
    // node = new Node(ctx.LABEL);
    // parentNode.add(node);
    // Stack.push(parentNode);
    // parentNode = node;
  }

  void exitNode(NodeContext ctx) {
    parentNode = Stack.pop();
  }
}

Is this what you are asking?

mateuszpawlik commented 6 years ago

Are the curly braces still an issue? If yes, I have a nice C++ parser that does the job but I would have to reimplement it in Java.

mateuszpawlik commented 5 years ago

We're starting to discontinue this repository. Please refer to the C++ implementations of our algorithms.