chainguard-dev / malcontent

detect malicious program behaviors
Apache License 2.0
407 stars 26 forks source link

Make all map operations concurrency-safe; fix nested archive extraction #424

Closed egibs closed 1 month ago

egibs commented 1 month ago

Closes: https://github.com/chainguard-dev/bincapz/issues/423

We were still using map[any]any types for operations happening within Goroutines. This was causing non-deterministic numbers of paths to be scanned (sometimes correct, sometimes way too many, sometimes too few). The main culprit was extractedFiles := make(map[string]bool) which was used when handling nested archive extraction, so really, this problem was limited to scanning archives or nested archives.

This PR replaces all remaining map[any]any types used within scan.go with sync.Map types.

With the original testing, I would see 1/4-1/5 of the attempts produce an incorrect number of paths and now the number of scan paths is always correct (as checked against an older version of bincapz).

To validate this change, I'm running an endless loop and printing the number of files:

$ while true; go run . --min-file-risk any --format=json ~/Downloads/archive.zip | jq -r '.Files | length'; end
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9

Earlier, there were several instances of 4 instead of 9 which was the giveaway that something wasn't right.

egibs commented 1 month ago

Turns out this bug exists in the non-concurrent versions of bincapz as well. There's likely some sort of bug with the nested extraction process.

egibs commented 1 month ago

Turns out this bug exists in the non-concurrent versions of bincapz as well. There's likely some sort of bug with the nested extraction process.

Fixed in 3e1d4f2 (#424) and 9d7e2d8 (#424). I added a recursive portion to the nested archive extraction so that we extract all possible archives.