Closed StevanCakic closed 4 years ago
This is a hard problem if you're trying to be insensitive to reordering/variable names (I'm assuming you are either doing this for plagiarism/open source origin detection, or against obfuscated code). The additional structure that LibCST provides may be distracting in this case.
Do you have a more limited subset of "code similarity" that you're trying to tackle? I would suggest that comparing a filtered (e.g. any string -> "x") tokenize
stream with difflib
is a simple solution that behaves reasonably, but tends toward false positives.
@thatch Thank you for an answer. Yes, you are right. I'm planning to use something like a plagiarism detector, more precisely to cluster similar codes. I think you are right with tokenization. About limitation, most of the files are Python files and it's ok not to detect variables reordering. What is maybe interesting is that I also want to detect similarity for spaces, newlines, etc.
My initial plan was to upload, for example, 100 Python files, and to cluster them in x clusters. It can maybe work with K-nn but with a lot of features K-nn is not good, and I couldn't know how many clusters can be in advance.
I haven't seen any literature that focuses on coding-specific tree comparisons, and all of my personal experiments that were performant used streams. I believe that's what efficient ML techniques use as well.
Note that tokenize preserves e.g. INDENT
as a psuedo-token, so changes in indentation do cause a (slight) change in the token stream. I've also used compile()
and looked at the raw opcodes, which is a little less sensitive to renames (but still sensitive to reordering). Something like sanitizing strings and then using n-grams is the easiest to understand for a bulk metric.
You could certainly code up something to compare parsed trees, but this would get out of hand fast. Let me know if you come up with anything that works even moderately well (e.g. LSH adapted to trees). I'm going to close this out, as I don't think it's a feature LibCST is likely to get, but feel free to continue the conversation while closed.
Hi everyone. I want to ask a question, is there any function in libcst or combination with some other lib (for example diffflib) to check code similarity? CTS is exactly what I need, preserving all whitespaces, newlines, etc. with code context
Is it possible to use matchers for this purpose? Currently, my code comparing str(tree1) and str(str2) using difflib.SequenceMatcher but I think maybe a better solution exists. Problem is, for example when I have unused variables, similarity drops at a high rate. I hope this is the right place to ask this kind of question :)