JoaoFelipe / apted

Python APTED algorithm for the Tree Edit Distance
MIT License
86 stars 13 forks source link

Negative edit distances #5

Closed yanamal closed 1 month ago

yanamal commented 4 years ago

I am trying to use this library to compare ASTs of python code (specifically, I'm comparing an incorrect solution to a problem with several correct solutions, to pick a closest one).

I created a custom Config to specify how to get children and node names out of an ast node(below). This seems to work, except that I sometimes get negative edit distances; and other times I get edit distances that are clearly too small.

Below is a complete example, including all my logic for dealing with python ASTs. It returns -8 for the edit distance. Please advise.

import ast
from apted import APTED, Config

# for the purpose of tree differencing, we consider both the ast.Node objects and various literals representing values, ids, etc. to be nodes in the AST.
# therefore, we assume that all nodes that are passed to functions in astConfig are either ast.Node objects, or literals of some kind
class AstConfig(Config):
    def get_node_name(node):
        if hasattr(node, '_fields'):
            # this is an ast.Node object - return its type
            return type(node).__name__
        else:
            # this must be a literal, e.g. variable name, const value, etc. 
            return str(node)

    def children(self, node):
        if hasattr(node, '_fields'):
            # this is a Node object. It has all its children listed in _fields.
            # a child could be:
            # a) a literal, e.g. the name of the variable
            # b) a list of nodes, e.g. list of expressions in the body of a function
            # c) a direct child node.
            # we combine these into one flat list. 
            children = []
            for childname in node._fields:
                child = getattr(node, childname)
                if isinstance(child, list):
                    children.extend(child)
                else:
                    children.append(child)
            return children
        else:
            # this node itself is a literal; it has no children.
            return []

    def rename(self, node1, node2):
        name1 = AstConfig.get_node_name(node1)
        name2 = AstConfig.get_node_name(node2)
        return 1 if name1 != name2 else 0

# test apted+ast:
p1 = ast.parse('''def helloWorld():
    print ('Hello World!')''')
p2 = ast.parse('''def helloWorld():
    return 'Hello World!'
''')
apted = APTED(p1, p2, AstConfig())
print(apted.compute_edit_distance())
print(apted.compute_edit_mapping())

The mappings do seem to be reasonable, and picking the smallest edit distance does seem to choose the closest correct solution; so whatever is going on, it at least seems to be consistent across different pairings.

JoaoFelipe commented 4 years ago

Hi @yanamal,

I executed here and the result was 5, with a mapping that seems reasonable for this result:

5
[
    (<_ast.Module object at 0x000002968C462D08>, <_ast.Module object at 0x000002968C4F6EC8>), 
    (<_ast.FunctionDef object at 0x000002968C503C48>, <_ast.FunctionDef object at 0x000002968C4F6D48>),
    (<_ast.Expr object at 0x000002968C503C08>, None),  # delete + 1 
    ('helloWorld', 'helloWorld'), 
    (<_ast.arguments object at 0x000002968C503248>, <_ast.arguments object at 0x000002968C25B3C8>),
    (None, None),
    (None, None),
    (<_ast.Call object at 0x000002968C503848>, <_ast.Return object at 0x000002968C33E808>), # rename + 1 
    (<_ast.Name object at 0x000002968C503648>, None), # delete + 1 
    (<_ast.Load object at 0x000002968A3D1C88>, None), # delete + 1
    ('print', None), # delete + 1 
    (<_ast.Str object at 0x000002968C503688>, <_ast.Str object at 0x000002968C33E5C8>), 
    ('Hello World!', 'Hello World!'),
    (None, None)
]

What is your Python and APTED versions? Can you post your log with the mapping?

Since the edit distance is calculated by the sum of the methods delete, insert, and rename in Config, I don't see how it is possible to return a negative value. Can you double check that your ASTConfig does not implement these methods returning negative values?

Another thing that may cause inconsistencies is the presence of singletons in AST nodes (if I remember it correctly, the expr_contexts are singletons), since a shared node may break the tree structure of the AST. I don't think it would cause problems for APTED, but you may want to create a copy of the children list, just in case.

yanamal commented 4 years ago

My Python version is 3.8.2 and my apted version is 1.0.3 (I just did a fresh pip install today).

See output below - ast_test.py is the same exact code in the original bug report. It looks like the edit script is the same as the one you got.

I did look over my custom AstConfig, but the only value I changed was for the rename, and it can only return 1 or 0. Just for sanity, I used the same exact expression as in your example (return 1 if name1 != name2 else 0). As a sanity check, I also tried copy-pasting the other two (delete/insert) out of the original Config class (https://github.com/JoaoFelipe/apted/blob/master/apted/config.py#L34), but that didn't change anything.

Thanks for the pointer about expr_context being singletons, I'll look into that. I do agree that it shouldn't affect APTED, since it doesn't seem to refer to the nodes directly, and my children() override doesn't seem to assume they are unique objects.

Yanas-MBP-2:python-predict yanamal$ pip show apted
Name: apted
Version: 1.0.3
Summary: APTED algorithm for the Tree Edit Distance
Home-page: https://github.com/JoaoFelipe/apted
Author: Joao Pimentel
Author-email: joaofelipenp@gmail.com
License: MIT
Location: /Users/yanamal/.pyenv/versions/3.8.2/lib/python3.8/site-packages
Requires: 
Required-by: 
Yanas-MBP-2:python-predict yanamal$ pyenv local
3.8.2
Yanas-MBP-2:python-predict yanamal$ python ast_test.py 
-8
[
(<_ast.Module object at 0x10f0dde20>, <_ast.Module object at 0x10f0b3130>), 
(<_ast.FunctionDef object at 0x10f0dde80>, <_ast.FunctionDef object at 0x10f0b3100>), 
(<_ast.Expr object at 0x10f0eb700>, None), 
('helloWorld', 'helloWorld'), 
(<_ast.arguments object at 0x10f0eb6d0>, <_ast.arguments object at 0x10f0b3910>), 
(None, None), 
(None, None), 
(<_ast.Call object at 0x10f0eb730>, <_ast.Return object at 0x10f0b3730>), 
(<_ast.Name object at 0x10f0eb760>, None), 
(<_ast.Load object at 0x10f0cfb20>, None), 
('print', None), 
(<_ast.Constant object at 0x10ebba9a0>, <_ast.Constant object at 0x10f0b3700>),
('Hello World!', 'Hello World!'), 
(None, None), 
(None, None), 
(None, None)
]
yanamal commented 4 years ago

Just tried switching to python 3.6.

That fixed the problem - the edit distance is now shown as 5!

Not sure whether something about the ast module changed between python versions, or something that's inside APTED. But it is indeed pretty alarming to see negative distances.

JoaoFelipe commented 4 years ago

I'm mostly sure that the problem in Python 3.8 is the presence of singletons/multitons in the AST that makes it not follow a proper Tree structure (as required by the APTED algorithm).

So, to workaround this limitation, I made a proxy class (Tree) that unfolds the AST "graph" into an actual Tree, and it is working fine now:

import ast
from apted import APTED, Config

# for the purpose of tree differencing, we consider both the ast.Node objects and various literals representing values, ids, etc. to be nodes in the AST.
# therefore, we assume that all nodes that are passed to functions in astConfig are either ast.Node objects, or literals of some kind

class Tree(object):

    def __init__(self, node):
        self.node = node

    @property
    def children(self):
        if hasattr(self.node, '_fields'):
            # this is a Node object. It has all its children listed in _fields.
            # a child could be:
            # a) a literal, e.g. the name of the variable
            # b) a list of nodes, e.g. list of expressions in the body of a function
            # c) a direct child node.
            # we combine these into one flat list. 
            children = []
            for childname in self.node._fields:
                child = getattr(self.node, childname)
                if isinstance(child, list):
                    children.extend(child)
                else:
                    children.append(child)
            return list(map(Tree, children))
        else:
            # this node itself is a literal; it has no children.
            return []

    def __repr__(self):
        return 'T ' + repr(self.node)

class AstConfig(Config):
    def get_node_name(node):
        if node.node is None:
            return "<None>"
        if hasattr(node.node, '_fields'):
            # this is an ast.Node object - return its type
            return type(node.node).__name__
        else:
            # this must be a literal, e.g. variable name, const value, etc. 
            return str(node.node)

    def rename(self, node1, node2):
        name1 = AstConfig.get_node_name(node1)
        name2 = AstConfig.get_node_name(node2)
        return 1 if name1 != name2 else 0

# test apted+ast:
p1 = ast.parse('''def helloWorld():
    print ('Hello World!')''')
p2 = ast.parse('''def helloWorld():
    return 'Hello World!'
''')
apted = APTED(Tree(p1), Tree(p2), AstConfig())
print(apted.compute_edit_distance())
print(apted.compute_edit_mapping())