anchore / grype

A vulnerability scanner for container images and filesystems
Apache License 2.0
8.86k stars 574 forks source link

grype image scan results non-deterministic #522

Open Dentrax opened 2 years ago

Dentrax commented 2 years ago

What happened:

grype generates different output content for the same image, which breaks the reproducibility.

Motivation comes from the https://github.com/in-toto/attestation/issues/58 to put output result digest in the vuln spec. cc: @developer-guy

Not sure whether this is intentional or time/map object related.

What you expected to happen:

All the output results for the exactly same IMAGE@sha256:digest should generate the same digest.

How to reproduce it (as minimally and precisely as possible):

$ docker image pull golang:1.17
$ grype golang:1.17 --output cyclonedx --file result1
$ docker image rmi golang:1.17
$ grype golang:1.17 --output cyclonedx --file result2
$ sha256sum result1
5ce609c4b26876f6394d57182346bbf46bc863753010b990b862d49caa9874ee result1
$ sha256sum result2
d1929364ecb727d14615d1b813f3c9b499ada7ef4979ff1c933c73444fdff84d result2
$ grype golang:1.17 --output json --file json1
$ grype golang:1.17 --output json --file json2
$ sha256sum json1
c9883e5a2c64631448145beaf20af51a1d085f3d47bba3db6555d28839a82072  json1
$ sha256sum json2
c0e2090127d7ba33f3b6f09732a26789f5c65f9c7068063507813793b35d44a3  json2

Anything else we need to know?:

I tried the same commands with the trivy. And both SARIF & JSON output formats produced same digest:

$ trivy image --format template --template "@./sarif.tpl" -o report1.sarif golang:1.17
$ trivy image --format template --template "@./sarif.tpl" -o report2.sarif golang:1.17
$ sha256sum 2fcf23e25debe08f3b5490f077bd926d85afc67c94df28f173bd0aa766e4ce24 report1.sarif
$ sha256sum 2fcf23e25debe08f3b5490f077bd926d85afc67c94df28f173bd0aa766e4ce24 report2.sarif

Maybe we can get help from the trivy team so cc'ing @knqyf263.

trivy: 0.21.2

Environment:

luhring commented 2 years ago

Hi @Dentrax, thanks for the issue!

I saw you ran this command:

grype golang:1.17 --output cyclonedx --file result1

The CycloneDX output contains data that's known to be nondeterministic, like a timestamp. Because of this, there's no way to expect the digests of two scans to be identical.

I see you ran Trivy with a template specified. You can do the same thing with Grype, and this gives you enough control of Grype's output to ensure that results are reproducible (and that you'd get the same digest between multiple scans).

Does that make sense?

Dentrax commented 2 years ago

I tired to pass --output json flag as you can see in the issue, but it produces non-deterministic digests too. I think it's related to what you said for cyclonedx. (timestamp etc.)

@luhring Ability to pass custom templates would make sense!

luhring commented 2 years ago

Cool!

For how to use templates with Grype, see: https://github.com/anchore/grype#using-templates

For the JSON output format (and possibly others), I think it's worth a discussion on if we want to modify the format to become deterministic. This would mean that we lose metadata like timestamps, but maybe that's okay. 🤔

luhring commented 2 years ago

Another thought... in the name of reproducible results, even with code changes to Grype's output formats, I think we should document the additional steps needed to be performed by the user in order to guarantee a reproducible result, such as:

Dentrax commented 2 years ago
* obtaining the vulnerability database ahead of time, and telling Grype not to update the database at execution time

* ensuring that the scan target itself is referenced in a deterministic way (e.g. an image **digest**)

Sounds so cool! Moreover, by performing this actions, maybe we can upload the deterministic scan result digest to ~fulcio~ Rekor. 🤔

So we can ensure any image foo@sha256:bar in this case, produces exactly baz scan result digest. Not so sure what we can do it later, but it would be a cool idea. cc: @dlorenc

luhring commented 2 years ago

That's interesting. Would we want to upload the scan signature+digest to Rekor? I'm not familiar with how this would fit into Fulcio yet.


we can ensure any image foo@sha256:bar in this case, produces exactly baz scan result digest

There's another important point about reproducibility here: A given fixed image digest should be scanned frequently, and with the latest vulnerability data available at the time, because new vulnerabilities are discovered every day (and, even previously discovered vulnerabilities have their data in upstream data sources updated from time to time).

With this recommended approach of scanning repeatedly, with new vulnerability data, we wouldn't want to assert that all scan results have the same digest. We'd want to allow for new vulnerability matches to be discovered, reported, and used as input to policies wherever appropriate.

^ This point might be obvious, but I wanted to make it explicit just in case, since we're talking about having an image scan produce consistent results. 😃

Dentrax commented 2 years ago

I'm not familiar with how this would fit into Fulcio yet.

My bad, I meant Rekor. 🙈

we wouldn't want to assert that all scan results have the same digest.

Oh, now I clearly see the concern and why we should not assert the digests. But what if we are using the same vuln-db version? Let's assume we have the vuln-db versioned v1. And 2 same images with the same digests. In this case, would it make sense to assert that all scan results have the same digest?

So we can push a tlog to Rekor such as: _I scanned the image foo@sha256:bar against vuln-db v1 using grype v0.26.1 and I expect a JSON output that has digest qux._

But still not so sure whether it makes sense since we update the vuln-db every X hour. 🤷

luhring commented 2 years ago

But what if we are using the same vuln-db version? Let's assume we have the vuln-db versioned v1. And 2 same images with the same digests. In this case, would it make sense to assert that all scan results have the same digest?

Yup, exactly! We would be able to expect reproducible scan results in this particular scenario.

So we can push a tlog to Rekor such as: I scanned the image foo@sha256:bar against vuln-db v1 using grype v0.26.1 and I expect a JSON output that has digest qux.

Yeah, I like this. And IMHO we should also provide more information about the vulnerability database, including its digest.

But still not so sure whether it makes sense since we update the vuln-db every X hour.

I think we should strive for reproducibility 💯 under the right circumstances. And we should think about how people will consume these kinds of vulnerability scan attestations and Rekor entries to make informed decisions about the security of their artifacts.

Dentrax commented 2 years ago

How should we proceed here? :)

wagoodman commented 10 months ago

Not all output formats are guarenteed to be reproducible. For instance, CycloneDX can never be reproducible given that IDs are recommended to be random.

That being said, there is a chance to make grype JSON documents reproducible:

❯ grype golang:1.17 --output json --file result1.json
 ✔ Vulnerability DB                [no update available]
 ✔ Loaded image                                                                                                                                golang:1.17
 ✔ Parsed image                                                                    sha256:8685b3216ef4a80742c4d5f29f547838997cc0c7cca68222cfdab7c6821ccf5b
 ✔ Scanned for vulnerabilities     [1130 vulnerability matches]
   ├── by severity: 36 critical, 288 high, 308 medium, 32 low, 448 negligible (18 unknown)
   └── by status:   443 fixed, 687 not-fixed, 0 ignored
A newer version of grype is available for download: 0.74.2 (installed version is 0.74.0)

❯ grype golang:1.17 --output json --file result2.json
 ✔ Vulnerability DB                [no update available]
 ✔ Loaded image                                                                                                                                golang:1.17
 ✔ Parsed image                                                                    sha256:8685b3216ef4a80742c4d5f29f547838997cc0c7cca68222cfdab7c6821ccf5b
 ✔ Scanned for vulnerabilities     [1130 vulnerability matches]
   ├── by severity: 36 critical, 288 high, 308 medium, 32 low, 448 negligible (18 unknown)
   └── by status:   443 fixed, 687 not-fixed, 0 ignored
A newer version of grype is available for download: 0.74.2 (installed version is 0.74.0)
# $ diff result1.json result2.json
134982c134982
<    "file": "result1.json",
---
>    "file": "result2.json",
135062c135062
<   "timestamp": "2024-01-25T16:31:22.174899-05:00"
---
>   "timestamp": "2024-01-25T16:31:36.511252-05:00"

Keeping a time element is critical to vulnerability scans, but there are two time elements in the json output:

cat result2.json | jq '.descriptor'
{
  "name": "grype",
  "version": "0.74.0",
  "configuration": {
    ...
  },
  "db": {
    "built": "2024-01-25T01:27:56Z",
    "schemaVersion": 5,
    "location": ".../Library/Caches/grype/db/5",
    "checksum": "sha256:0e70dc967985e5a56678500b60aefb9442183c03301261252c7abd7dfae92784",
    "error": null
  },
  "timestamp": "2024-01-25T16:31:36.511252-05:00"
}

Note:

We could add an option that would remove the .descriptor.timestamp from the grype output, which would make results reproducible when the same configuration/DB is being used. For use cases when you are using different DBs or configuration it is necessary to get the subselection of the grype document you need to do that:

❯ cat result1.json | jq '.matches' | sha256sum
d149e542ee35687266abd6cef70b0038131ee854eb0750d98244acf2c3d760b6  -

❯ cat result2.json | jq '.matches' | sha256sum
d149e542ee35687266abd6cef70b0038131ee854eb0750d98244acf2c3d760b6  -

This could be something like GRYPE_TIMESTAMP=false (env), but probably not a CLI flag.