Closed egibs closed 1 month ago
I'm attempting to grab a profile of a diff between two Spark directories (just aarch64
and x86_64
so no version differences).
The memory usage is extremely high but we'll know more once the profiles are rendered.
I was able to pare down the memory usage by revisiting the concurrency here: https://github.com/chainguard-dev/bincapz/blob/bdcb6400223228bdf10efa48f9f8c9ed99d8524a/pkg/action/diff.go#L177-L197
When dealing with small numbers of files, this isn't an issue but, again, becomes problematic when looking at thousands or hundreds of thousands of files.
I'm still waiting to see how long a full Spark diff takes with my changes.
The diff code path is separate from the usual scanning code path and is much heavier than the latter in the following ways:
Even within the
Diff
code path, steps one, two, and four can run or steps one through four can run.In cases where we're diffing packages with thousands files, the doubling up can tack on a substantial amount of time (and increase the resource usage overhead). For smaller packages or single files, this is not an issue. We used to only perform diff checks on library files, but there's always the possibility that excluding other file types would ignore critical findings.