anchore / grype

A vulnerability scanner for container images and filesystems
Apache License 2.0
8.17k stars 528 forks source link

fix: Go main module pseudo version matching: turn off by default #1894

Closed luhring closed 1 month ago

luhring commented 1 month ago

About 6 weeks ago, I opened https://github.com/anchore/grype/pull/1797 in an effort to reduce false negatives for the main modules of Go binaries. Before that PR, Grype's previous behavior was to avoid vulnerability matching for the main module altogether if the detected module version was prefixed with v0.0.0-. Grype's reasoning for this had been that Syft will manufacture pseudo versions that start with v0.0.0- in the case where it's not able to find a real version in the Go binary itself.

By making the change in the PR linked above, we indeed saw some reduction in false negatives πŸŽ‰ , which was great. We weren't sure about the larger impact of false positives. I only speculated about the potential effects:

My reason for expecting them is that I can dream up edge cases, like when the GHSA fixed version is v3.0.1, but the computed pseudoversion would be v0.0.0-<some commit that's later than v3.0.1>, and so Grype would show the main module as vulnerable when it's not.

The bad news: After seeing this change released and used on a broader set of images, it looks like this net effect is that Grype's F1 score is worsened considerably more than we anticipated. We've been able to recover some recall, but at a painful hit to precision.

The good news: I saw that after my original PR was merged, @spiffcs opened https://github.com/anchore/grype/pull/1816, which made this new behavior configurable and defaulted to on. IMHO, that was brilliant, and in hindsight I probably should've added that in my original PR, too. 😞

At scale, this now-configurable behavior is working very much like CPE matching: it's great for scenarios where the user is willing to pay a high FP cost for the chance at not missing any matches, but on average it adds more noise than signal.

So with this PR, I suggest we adjust the noisy config option to work just like language package CPE matching: off by default but configurable to on, such that consumers can get more sane behavior by default, but they can crank up the sensitivity knob if they know what they're doing.

Curious for your thoughts!

wagoodman commented 1 month ago

Thanks for the insights here! Are there any specific modules/version/CVE data points that you have? I'd like to try and capture these as new data that we use continually in the quality gates for vunnel/grype-db/grype (happy to help).

luhring commented 1 month ago

Thanks for the insights here! Are there any specific modules/version/CVE data points that you have? I'd like to try and capture these as new data that we use continually in the quality gates for vunnel/grype-db/grype (happy to help).

Great idea. Here's one handful of examples: https://github.com/search?q=repo:wolfi-dev/advisories+%22which+means+the+installed+version+was+misidentified%22&type=code

kzantow commented 1 month ago

I could certainly be wrong here, but for what it's worth, I think the right behavior is to consider only comparing pseudo-versions with pseudo-versions, but maybe consider using the timestamp portion. What I mean by this is as follows: I've seen go module versions be one of 3 distinct things (although the require syntax docs only mention 2):

Vulnerabilities are reported against the best version someone can find, and people use modules from different sources including, say a main branch, which means we will potentially see any combination of these versions on both sides of the matching equation. If we have a pseudo-version with all zeroes, there's no direct relation to a real semver-like version and we really just can't compare them -- it's the proverbial apples and oranges. If we have a speculated pseudo-version, there's a relation to both and we could potentially compare to either but comparing to a full-zero pseudo-version again isn't going to be especially fruitful. So my proposal would be as follows:

if <both versions have semver-like numbering>
  compare semver section
else if <both versions have date-like numbering>
  compare date section
else
  don't compare

The <date> portion is supposed to be repository revision date, which admittedly is not a substitute for proper version lineage. But I believe it's the best we can do without having more context about the hash info. Or I could be all wrong!

luhring commented 1 month ago

Hm, I think it's a little simpler than that... In general, I think pseudo versions are fine to use in vulnerability matching. I don't believe we need to create a new kind of version comparison logic just for them.

The specific issue here is that there's a case where Syft is creating new pseudo versions on the fly, that is, without sourcing them from data found in the actual Go binary. It only does this for Go binary main modules, and only when there's no detected semver or pseudo version for the module already.

This logic is here:

https://github.com/anchore/syft/blob/ac34808b9c55bb274b1205f9b5d9cf495239577d/syft/pkg/cataloger/golang/parse_go_binary.go#L196-L207

So from Grype's perspective, the sign that a false positive is likely to happen is when a Go main module has a version prefixed by v0.0.0-. This doesn't guarantee a false positive, or even guarantee that Syft crafted this value on the fly, but it seems to produce false positives more often then not, due to the (seemingly common) case where Syft has gotten the version value completely wrong (e.g. v0.0.0... instead of v3.4.2...).

luhring commented 1 month ago

I think the change in this PR will help recover Grype's accuracy the fastest. But maybe we could treat this as an incremental improvement, and do more to help after? πŸ€”

For example, what if Syft was more explicit in its output data that it had invented a module version rather than detected it? This would let us adjust the logic in Grype to be more precise, where it could turn off matching (by default) only for the case where the v0.0.0- version was known to be made up rather than detected?

spiffcs commented 1 month ago

Throwing in a πŸ‘ to the @luhring comment of:

I think the change in this PR will help recover Grype's accuracy the fastest.

I'm pro changes that make our tool by default better for people to use. If we merge this PR I can file an issue summarizing the other problems we want to fix going forward surrounding syft's hallucinated Golang versions and it's ability to communicate to scanners whether the version in the package is a best guess or actually read from the metadata

luhring commented 1 month ago

Thanks @spiffcs!