Investigate diff performance

egibs commented 1 month ago

The diff code path is separate from the usual scanning code path and is much heavier than the latter in the following ways:

Each provided path is scanned (this is twice as many file extractions, scans, reports, etc.)
The source and destination paths are interrogated to see which files or behaviors appear in one or the other (or both)
If files are modified (not added or removed), an O(n^2) loop is used to build a list of reports (though this is not guaranteed to run)
A final diff report is returned

Even within the Diff code path, steps one, two, and four can run or steps one through four can run.

In cases where we're diffing packages with thousands files, the doubling up can tack on a substantial amount of time (and increase the resource usage overhead). For smaller packages or single files, this is not an issue. We used to only perform diff checks on library files, but there's always the possibility that excluding other file types would ignore critical findings.

egibs commented 1 month ago

I'm attempting to grab a profile of a diff between two Spark directories (just aarch64 and x86_64 so no version differences).

The memory usage is extremely high but we'll know more once the profiles are rendered.

egibs commented 1 month ago

I was able to pare down the memory usage by revisiting the concurrency here: https://github.com/chainguard-dev/bincapz/blob/bdcb6400223228bdf10efa48f9f8c9ed99d8524a/pkg/action/diff.go#L177-L197

When dealing with small numbers of files, this isn't an issue but, again, becomes problematic when looking at thousands or hundreds of thousands of files.

I'm still waiting to see how long a full Spark diff takes with my changes.

chainguard-dev / malcontent

Investigate diff performance #426