SourceCode-AI / aura

Python source code auditing and static analysis on a large scale
GNU General Public License v3.0
486 stars 31 forks source link

Migrate diff functionality from git to tlsh #4

Closed RootLUG closed 3 years ago

RootLUG commented 3 years ago

Aura diff functionality currently uses git as an underlying mechanism. Creating a repo in a temporary directory and then making two commits with "left-hand side content" and "right-hand side content". Diffing is then done by leveraging native git functionality of diffing those two commits to detect changes between those two input sources.

Although this is simple it is not very performant and resource-effective since it creates several copies of the input data. The diff functionality should be migrated from using diff to using tlsh which is already used in Aura at other places. Using tlsh, we can compute similarity pairs between input files and thus detect in the same manner which files are the same, changed, renamed, removed, or added. A similar approach is also used in the diffoscope project to diff input data.

RootLUG commented 3 years ago

implemented in dev branch experimentation shows that tlsh provides on some inputs a very volatile similarity estimation in extreme cases even +/- 40%. I decided to use the LZJD algorithm to compute the similarity ratios instead of tlsh s it was performing better and more accurate in tests and also meant that the diff functionality would not rely on the tlsh dependency. Reference implementation: https://github.com/EdwardRaff/pyLZJD , the actual one in Aura is pure python using the algorithm from the research paper without extra optimizations

Aura now contains also FileMatcher which computes closures on input files, e.g. finds which files were modified, added or removed, replacing the functionality of git commits used for that feature. The closure is computed using similarity ratios and edit distances on filenames via native's difflib SequenceMatcher.

This is still a WIP as the functionality works right now but is not yet configurable and well tested such as specifying similarity thresholds and depth limits for the file closures, etc...