Distro matchers should be guided by package type not detected distro

anchore / grype

A vulnerability scanner for container images and filesystems

Apache License 2.0

8.86k stars 574 forks source link

Distro matchers should be guided by package type not detected distro #86

Open wagoodman opened 4 years ago

wagoodman commented 4 years ago

Currently we use the detected distro to guide rpm, deb, and apk matchers to find vulnerabilities. This is functional, however, it would be more accurate to use the package type (rpm, deb, apk) to select the vulnerability namespace and not the distro detected (redhat:8, ubuntu:20, alpine:3:12).

Problem: we don't know the distro version from the package type, so it is not possible to select the "correct" vulnerability namespace. This is worth thinking about nonetheless.

zhill commented 1 year ago

Now that the PURLs contain distro information when generated by Syft, I think we could use the PURL info instead of the distro or types.

@wagoodman would an approach that uses the PURL info first, then falls back to other sources if not explicitly set in the PURL be reasonable?

I've also seen cases where folks pull in a package from another distro version (e.g. use a rhel 8 pkg in rhel 7) so this would allow us to properly handle that case if/when syft can detect it. But, assuming that a user has manually edited the syft output to be correct (PURLs etc), then Grype could consume it.

willmurphyscode commented 1 year ago

Agreed, this should definitely be possible now:

❯ syft -q alpine:latest -o json | jq -r '.artifacts[] | .purl'
pkg:apk/alpine/alpine-baselayout@3.4.3-r1?arch=aarch64&distro=alpine-3.18.3
pkg:apk/alpine/alpine-baselayout-data@3.4.3-r1?arch=aarch64&upstream=alpine-baselayout&distro=alpine-3.18.3
pkg:apk/alpine/alpine-keys@2.4-r1?arch=aarch64&distro=alpine-3.18.3
pkg:apk/alpine/apk-tools@2.14.0-r2?arch=aarch64&distro=alpine-3.18.3
pkg:apk/alpine/busybox@1.36.1-r2?arch=aarch64&distro=alpine-3.18.3
pkg:apk/alpine/busybox-binsh@1.36.1-r2?arch=aarch64&upstream=busybox&distro=alpine-3.18.3
pkg:apk/alpine/ca-certificates-bundle@20230506-r0?arch=aarch64&upstream=ca-certificates&distro=alpine-3.18.3
pkg:apk/alpine/libc-utils@0.7.2-r5?arch=aarch64&upstream=libc-dev&distro=alpine-3.18.3
pkg:apk/alpine/libcrypto3@3.1.2-r0?arch=aarch64&upstream=openssl&distro=alpine-3.18.3
pkg:apk/alpine/libssl3@3.1.2-r0?arch=aarch64&upstream=openssl&distro=alpine-3.18.3
pkg:apk/alpine/musl@1.2.4-r1?arch=aarch64&distro=alpine-3.18.3
pkg:apk/alpine/musl-utils@1.2.4-r1?arch=aarch64&upstream=musl&distro=alpine-3.18.3
pkg:apk/alpine/scanelf@1.3.7-r1?arch=aarch64&upstream=pax-utils&distro=alpine-3.18.3
pkg:apk/alpine/ssl_client@1.36.1-r2?arch=aarch64&upstream=busybox&distro=alpine-3.18.3
pkg:apk/alpine/zlib@1.2.13-r1?arch=aarch64&distro=alpine-3.18.3

We should be able to make a distro object from the package. I'll pick this up, since it seems like it will help match quality.

willmurphyscode commented 1 year ago

I have a couple concerns here with the linked PR before I am ready to merge it:

There's a TODO, that if we have an SBOM from a distro Grype doesn't support, but it contains packages from a distro we do support, we should log a warning but try to match against those packages. Do folks agree that this would be the right approach?
Constructing a linux distro from the PURL is lossy, debian example below. Do we consider this possibly blocked on syft work to get more distro info into PURL? Right now, I the PR will never use the PURL distro on debian because there's no version there, which probably covers the current behavior, but what if we start using other fields on the distro object to filter matches in the future?
How does this relate to https://github.com/anchore/grype/issues/827 - right now, I believe that syft, when scanning an image, will assume that all OS-package-manager packages on in the image are from the detected distro, so the behavior of grype won't really change anyway on syft -o json | grype.

Syft commands showing information lost from distro node in SBOM vs distro key in PURL:

❯ syft -q -o json debian:unstable-slim | jq '.artifacts[] | .purl'
"pkg:deb/debian/apt@2.7.6?arch=arm64&distro=debian"

❯ syft -q -o json debian:unstable-slim | jq '.distro'
{
  "prettyName": "Debian GNU/Linux trixie/sid",
  "name": "Debian GNU/Linux",
  "id": "debian",
  "versionCodename": "trixie",
  "homeURL": "https://www.debian.org/",
  "supportURL": "https://www.debian.org/support",
  "bugReportURL": "https://bugs.debian.org/"
}

Brian-McM commented 1 year ago

There's a TODO, that if we have an SBOM from a distro Grype doesn't support, but it contains packages from a distro we do support, we should log a warning but try to match against those packages. Do folks agree that this would be the right approach?

I would think that it's the right approach, and I would think you wouldn't need to even log a warning tbh. If a package was built for a specific distro, I would think that it would have the vulnerabilities for that package on that distro it was meant for, and you'd want to use that distro's CVE database to get the severity.

I would think that the biggest application of this is scanning images where you're copying packages from one layer to the next, likely with the final image being a scratch image. One other way of approaching this (if the PURL doesn't always have the correct information) is to fall back to the distribution that was detected on the layer where the package was found. I would think this would provide more accurate results than using the final layers detected distro for all found packages.

wagoodman commented 1 year ago

In cases where there is a multi-stage build this could be useful:

FROM fedora:latest as fedora

# get some packages...
RUN dnf install ...

FROM ubuntu:latest as ubuntu

# get some packages ...
RUN apt install ...

FROM scratch

COPY --from fedora ...
COPY --from ubuntu ...

We don't have visibility in this case at all, since the image being analyzed only has one layer, the last scratch section. There isn't much we can do here. OS package managers don't really leave around information on a per-package basis to figure which distro the package was sourced from (anyone: please correct me if I'm wrong and point out where to look for this!).

There is a pseudo-related issue https://github.com/anchore/syft/issues/435 which talks about trying to track all of the layers within an image that has a reference to this package (which is different). Syft and grype allow you to track all of the layer references for a package with --scope all-layers (by default syft uses --scope squashed). A typical image doesn't tend to change the evidence of what the OS is across layers (unless lots of deletions are occurring).

Brian-McM commented 12 months ago

Thanks for the response @wagoodman, and sorry for the late reply (I saw this when you posted it but forgot to respond).

We don't have visibility in this case at all, since the image being analyzed only has one layer, the last scratch section

Right, that is a big "problem", I actually hadn't realised that was the case for multi stag builds (which I actually use).

I suppose the only way to get distribution information from a multi stag scratch build would be if the docker file owner added a distribution hint to the docker file explicitly. Would you agree @wagoodman?

This layer tracking approach could still be valuable for the mult layered images though where the packages might not be installed through the package manager in the layer.

There is a pseudo-related issue https://github.com/anchore/syft/issues/435 which talks about trying to track all of the layers within an image that has a reference to this package (which is different). Syft and grype allow you to track all of the layer references for a package with --scope all-layers (by default syft uses --scope squashed). A typical image doesn't tend to change the evidence of what the OS is across layers (unless lots of deletions are occurring).

~~So would you think that along with tracking what layer the packages were found in, syft could also record information about that layer (like that distribution)?~~

I'm wondering if what I was previously thinking even makes sense for images that aren't multi staged (just regular multi layered). You wouldn't really need to track the distribution of each layer and associate them to the packages on that layer, I don't think it's really easy (or possible...) to have a "multi distro" build without using the multi stage builds. It's not like the distro is going to change for these layers (I could be wrong though...), so all we'd really need is the distro detection logic to see if it can find a distribution hint in the previous layers if it can't find them in the current layer.

zhill commented 10 months ago

👋 Great discussion so far! I've got a couple of use-cases for this capability that may help guide decisions here. Apologies if these duplicate some discussion above (I think there is some overlap with the "local build" discussion) :

Synthetic SBOMs crafted to represent an application or app-stack rather than one that is created from the analysis of a specific artifact. I've seen this in cases where sboms are either hand-crafted or composited from any existing sources but the user wants to be able to get a Grype scan on the whole application as a shippable unit. This is fairly niche but could be supported by this work in addition to the other more common use-cases.
Pulling packages from a different version of the same distro to accomplish things like back-porting beyond EOL boundaries (e.g. pull a CentOS 9 package into a CentOS 8-based image in a case where it is a newer major version or has fixes that are not planned to be back-ported.
Use of the package manager to install non-distro software (e.g. gcloud sdk, etc). Distro detection isn't applicable in this case, but using the PURL could clearly identify a package of a specific type as not being from the distro (e.g. empty distro) and thus enable Grype to invoke fall-back behavior such as using NVD or other sources. This case isn't clearly the target of this issue itself, but moving the detection of what namespace to match a package against from an SBOM-level construct to a per-artifact one gives the process necessary to start differentiating those things. Though I grant that there are also other ways to solve that problem.

For the "what to do with partial distro information" case, there are some interesting options IMO depending on which use-case the user is trying to achieve since that may impact the reason for the partial information. Because of that, it seems like configurable behavior is best.

The options I can identify thus far (open to suggestions!) are:

(default) Fall back to SBOM-level distro if present. If not, then treat as the package would have been if no PURL were present and use the existing default behavior for an SBOM without distro info.
Use as much distro information as possible and scan against all namespaces that match what info is present. For example, if only "distro=debian", then scan against all debian namespaces and return all results. The user can decide what to do but will see in each result which namespaces caused matches.
Fall back to using NVD-based matches since it's not determined that the package came from a distro at all. This seems like the right behavior for cases where the package is of the distro type, but cannot be confirmed to be from the actual distro vendor (the logic for detecting this is TBD, but the matching behavior would be ready). It would also allow users to manually modify their sboms to properly indicate such situations and have the vuln matches respect those edits and return expected results.

What do you all think?

willmurphyscode commented 10 months ago

Use as much distro information as possible and scan against all namespaces that match what info is present. For example, if only "distro=debian", then scan against all debian namespaces and return all results. The user can decide what to do but will see in each result which namespaces caused matches.

I think it would be preferable to fall back to the SBOM distro in these cases; otherwise we might get a lot of false positives, for example from assuming that the package is the system perl that shipped with super old Debian or something.

Also, as part of this work, we should update Syft to include the Debian version number in the PURLs for debian packages it finds; that would be a better fix, but we still need to handle the case where there's a partial distro in the PURL.

It would also allow users to manually modify their sboms to properly indicate such situations and have the vuln matches respect those edits and return expected results.

This is an interesting idea.

One other concern here is: Do other SBOM tools put distros in PURLs at all? We don't want grype to assume that it's SBOM came from Syft too often.

zhill commented 10 months ago

Use as much distro information as possible and scan against all namespaces that match what info is present. For example, if only "distro=debian", then scan against all debian namespaces and return all results. The user can decide what to do but will see in each result which namespaces caused matches.

I think it would be preferable to fall back to the SBOM distro in these cases; otherwise we might get a lot of false positives, for example from assuming that the package is the system perl that shipped with super old Debian or something.

I agree that should be the default, but there are use cases where the FPs are a tradeoff that a user may want to make (e.g. they don't know which version of debian a package came from so showing all lets them see the full surface. Its not a common case but one that I think the tool could handle with explicit configuration from the user indicating they want to make that tradeoff. This is for the case where a user gets an SBOM they didn't create and/or wasn't created from a single tool.

Also, as part of this work, we should update Syft to include the Debian version number in the PURLs for debian packages it finds; that would be a better fix, but we still need to handle the case where there's a partial distro in the PURL.

Agreed fully.

It would also allow users to manually modify their sboms to properly indicate such situations and have the vuln matches respect those edits and return expected results.

This is an interesting idea.

One other concern here is: Do other SBOM tools put distros in PURLs at all? We don't want grype to assume that it's SBOM came from Syft too often.

Agree that we shouldn't assume the Syft semantics for a field specifically, but we should be able to make it clear to a user which fields in the SBOM are used, how, and enable them to get the matching behavior they want if they craft an SBOM in a specific way to match the security process or scope they want to achieve. That's why I'm ok with reducing code complexity by pushing these decisions to configuration so the user can tell Grype how they want it to behave in ambiguous cases.

willmurphyscode commented 10 months ago

I did some more experimenting, and it looks like syft includes the Debian version except for trixie/sid/unstable:

❯ syft -q -o json debian:bookworm-slim | jq '.artifacts[] | .purl'
"pkg:deb/debian/adduser@3.134?arch=all&distro=debian-12"
...
❯ syft -q -o json debian:trixie-slim | jq '.artifacts[] | .purl'
"pkg:deb/debian/apt@2.7.6?arch=arm64&distro=debian"

I met with @wagoodman and I think we can move forward with the grype work, if we change syft to include the distro codename in the PURL if there's no version ID.

That leaves us with the following changes:

Change syft to include distro codename if version ID not available
Change grype to use distro from PURL if distro name and version (or name and codename) are present, unless --distro flag is passed, in which case that should be used. If distro name and version are not both present on a package, fall back to SBOM distro field.
As a configurable option, if distro name but not version is present, match every against every namespace for that distro.
We'll make a follow-up issue to capture additional edge cases, such as exposing a grype config to use partial distro info from PURLs

Item 2 will be implemented by fixing up https://github.com/anchore/grype/pull/1530. Item 1 needs a separate change to Syft.

Does that sound good to everyone @zhill and @wagoodman ?

zhill commented 5 months ago

Thanks @willmurphyscode I got distracted by other stuff and didn't get back to this for a while. The plan sounds good, and opens the door to allow the PURL to describe other package sources even for the same type, such as an RPM from Google installed into a CentOS image. That could be detected and an accurate PURL created, which indicates we've got some future proofing as well here as the SBOM side evolves. Thanks!

willmurphyscode commented 1 month ago

This will be much easier to do as part of #2128, and requires that Syft puts some distro version info (even if only a codename) in the PURL even if the version is not available. I'm putting this back in the backlog and adding it to the schema v6 milestone.