JetBrains-Research / astminer

A library for mining of path-based representations of code (and more)
MIT License
280 stars 80 forks source link

different paths for same code content in python #205

Closed arghavanMor closed 2 years ago

arghavanMor commented 2 years ago

Hi, Thank you for the tool :)

When I apply astminer (in code2vec setup) on two different files but with the same content, the model returns different path-contexts for each file (only few numbers of common path). However, the AST tree and following all the paths are the same in both files.

Here is the code:

def search(x, seq):
    seq = list(seq)
    if seq == []: return 0
    else:
        for i in range(len(seq)):
            if x <= seq[i]:
                return i
            else: continue
        return len(seq)

I attached path_context, node_types and paths files. i changed the path_context format into text to be able to upload it here.

node_types.csv paths.csv tokens.csv path_contexts.txt

SpirinEgor commented 2 years ago

Hi! Thank you for your feedback :)

The number of paths per tree exponentially increases with tree size. To handle this, we restrict the total number of paths by maxPathContextsPerEntity parameter in config. You can set it to null or remove to save all paths from the tree.

arghavanMor commented 2 years ago

Thank you for your quick response. I remove maxPathContextsPerEntity from config and now all paths are the same for both files :)

Thank you.