SourceCode-AI / aura

Python source code auditing and static analysis on a large scale
GNU General Public License v3.0
486 stars 31 forks source link

Don't export blobs in unchanged locations #5

Closed RootLUG closed 3 years ago

RootLUG commented 3 years ago

When conducting a scan of the source code, Aura has a feature to "extract" blobs from the code and scan them separately. This is very useful to for example scan the content of a string passed to the eval or exec function as it is parsed and analyzed in the same way as the input python source file. This blob extraction is triggered for string longer than the threshold specified in the config file and in some cases could cause performance degradation if the source code file contains a lot of long string definitions.

This process can be further optimized by not extracting blobs at the locations that the diff functionality has not detected any changes.

RootLUG commented 3 years ago

After diving into this problem I concluded that this is not worth implementing (atm) given possible problems vs speedup this will give for aura diff, my reasoning:

The diff (difflib.unified_diff) would need to be replaced with a re-implementation that would expose the line numbers of changed locations. This would cause a code duplication since it would basically be a copy-pasted code of the implementation from the stdlib. Aggregated numbers of changed lines should then be somehow exposed (ScanLocation.metadata?) so that the analyzers plugin (string finder/blob extractor) can read it and decide based on the changes. This is doable but will however break often due to the optimizations done by Aura.

Namely, this will break due to constant propagations. Consider there is a string defined which is later propagated by ast optimizations into some functions call (eval(x) for example). The function call receiving the string has not changed so the blob extraction would skip/ignore this but the code from which the constant originated has changed. This is just one of the potential scenarios where the diff can be (wrongly) ignored which is far worse (as the user relly on reporting the differences by aura) than the performance gain from skipping unchanged locations for blob extraction.