Instagram / LibCST

A concrete syntax tree parser and serializer library for Python that preserves many aspects of Python's abstract syntax tree
https://libcst.readthedocs.io/
Other
1.55k stars 191 forks source link

[Question] Similarity between CST #404

Closed StevanCakic closed 4 years ago

StevanCakic commented 4 years ago

Hi everyone. I want to ask a question, is there any function in libcst or combination with some other lib (for example diffflib) to check code similarity? CTS is exactly what I need, preserving all whitespaces, newlines, etc. with code context

import libcst as cst

code1 = ""
code2 = ""
with open("file_to_test.py") as f1:
    code1 = f1.read()
with open("file_to_test2.py") as f2:
    code2 = f2.read()
try:
    tree1 = cst.parse_module(code1)
    tree2 = cst.parse_module(code2)
    # warning, example of pseudo code for similary
    cst.similarity(tree1, tree2) # and get result for example 0.85, or 85%

Is it possible to use matchers for this purpose? Currently, my code comparing str(tree1) and str(str2) using difflib.SequenceMatcher but I think maybe a better solution exists. Problem is, for example when I have unused variables, similarity drops at a high rate. I hope this is the right place to ask this kind of question :)

thatch commented 4 years ago

This is a hard problem if you're trying to be insensitive to reordering/variable names (I'm assuming you are either doing this for plagiarism/open source origin detection, or against obfuscated code). The additional structure that LibCST provides may be distracting in this case.

Do you have a more limited subset of "code similarity" that you're trying to tackle? I would suggest that comparing a filtered (e.g. any string -> "x") tokenize stream with difflib is a simple solution that behaves reasonably, but tends toward false positives.

StevanCakic commented 4 years ago

@thatch Thank you for an answer. Yes, you are right. I'm planning to use something like a plagiarism detector, more precisely to cluster similar codes. I think you are right with tokenization. About limitation, most of the files are Python files and it's ok not to detect variables reordering. What is maybe interesting is that I also want to detect similarity for spaces, newlines, etc.

My initial plan was to upload, for example, 100 Python files, and to cluster them in x clusters. It can maybe work with K-nn but with a lot of features K-nn is not good, and I couldn't know how many clusters can be in advance.

thatch commented 4 years ago

I haven't seen any literature that focuses on coding-specific tree comparisons, and all of my personal experiments that were performant used streams. I believe that's what efficient ML techniques use as well.

Note that tokenize preserves e.g. INDENT as a psuedo-token, so changes in indentation do cause a (slight) change in the token stream. I've also used compile() and looked at the raw opcodes, which is a little less sensitive to renames (but still sensitive to reordering). Something like sanitizing strings and then using n-grams is the easiest to understand for a bulk metric.

You could certainly code up something to compare parsed trees, but this would get out of hand fast. Let me know if you come up with anything that works even moderately well (e.g. LSH adapted to trees). I'm going to close this out, as I don't think it's a feature LibCST is likely to get, but feel free to continue the conversation while closed.