chainguard-dev / malcontent

detect malicious program behaviors
Apache License 2.0
407 stars 26 forks source link

Improve diff performance #429

Closed egibs closed 1 month ago

egibs commented 1 month ago

Closes: #426

Diff performance has been a gap as of late. The changes made in #356 were a little too greedy and resulted in calculating the Levenshtein distance and creating reports for files we don't care to diff.

I was able to diff Spark and generate a profile for one of the earlier iterations of the code in this PR and the levenshtein.Calculate method consuming most of the CPU time: image

Additionally, the concurrency wasn't implemented correctly which caused a lot of performance issues.

This PR re-adds the filtering of paths and reworks the concurrency used when creating the flatted reports as well as adding concurrency for each of the scans (part of the long execution time was the serial creation of the file reports which will now happen concurrently).

Prior to this change, running a diff on the fully-extracted Spark files would not finish; now the same scan takes ~15 minutes (with two maps consisting of 759,999 and 759,993 elements, respectively):

________________________________________________________
Executed in  898.07 secs    fish           external
   usr time   39.35 mins    0.10 millis   39.35 mins
   sys time   39.94 mins    3.66 millis   39.94 mins