anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.03k stars 554 forks source link

Docker base images should be included in the BOM #1199

Open captn3m0 opened 2 years ago

captn3m0 commented 2 years ago

What would you like to be added: A simple docker image with the following Dockerfile:

FROM php:7.4-cli

COPY scan.php /

should result in a SBOM that includes the base image as a component:

pkg:docker/library/php@7.4-cli

Why is this needed: A container image base image is also a "dependency". For popular base-images, this carries a lot of information, and this can be used to recursively look up other dependencies (that might have been included in the build process, but might not be part of the final image).

I'm not sure how feasible this is, considering docker doesn't seem to store the base image names, but this would be a great addition.

kzantow commented 1 year ago

Hi @captn3m0 -- are you really looking to get php:7.4 properly cataloged with this request, and it's a duplicate of https://github.com/anchore/syft/issues/1197? Or is this actually a request to get the base image container added as a component?

captn3m0 commented 1 year ago

PHP here is just an example - this is a request for the latter (base images are ingredients, and should be included in a SBOM).

captn3m0 commented 1 year ago

Investigated this a bit. Docker does not return the base image ID, just the relevant statements from the docker base image. For eg, the amazoncorretto:8u342-alpine3.16-jre image includes the following information about the upstream:

ADD file:2a949686d9886ac7c10582a6c29116fd29d3077d02755e87e111870d63607725 in /

The corresponding dockerfile has:

FROM alpine:3.16

And the hash can actually be found in the alpine:3.6 image: https://github.com/docker-library/repo-info/blob/master/repos/alpine/remote/3.16.md#alpine316---linux-amd64

I'm thinking about generating such common hashes, and publishing them on Rekor so this would get picked up via https://github.com/anchore/syft/issues/1159.

The intended mapping here would be

2a949686d9886ac7c10582a6c29116fd29d3077d02755e87e111870d63607725 ->
  pkg:docker/library/alpine@3.16

Which files should be looked up could be left to syft, or perhaps I can publish a bloom-filter that helps with quick evaluation for that locally. (Is this a "relevant" base image file).

khan-a1 commented 1 year ago

Hi team, any update on this feature request? it will be great if docker images can be added to SBOM

tgerla commented 1 year ago

Hi @khan-a1 and @captn3m0, sorry for the very long delay replying. We would like to understand a bit better your use case for including a reference to a docker image in the SBOM itself. Are you familiar with the different scoping options you can specify, with --scope?

We also have an open issue discussing ideas to expand the different scoping selections: https://github.com/anchore/syft/issues/15

Happy to re-engage on this issue and figure out how to move forward. Would you be able to join our community meeting at some point? It might be easier to talk things over live. https://github.com/anchore/syft/#join-our-community-meetings

captn3m0 commented 1 year ago

Will reply soon with a detailed proposal for why I think this is important.

I haven’t checked the scoping options yet.

I see there’s no meeting on the 21st Thursday, but I will try to join the one on 28th to explain this better.

captn3m0 commented 11 months ago

I've looked at the scoping options, and the various feature requests for that, and that doesn't fit this use-case.

An SBOM should be an actual artifact of all the components that went in building the final image. Docker base images are a relevant artifact imo.

The primary usecase for this comes from current limitations around Syft's binary matching capabilities, which result in not everything in base images being detected. If anything is installed in the base image outside a "package" - this is very common behavior for official base images - Syft cannot detect it easily.

In such cases, the name of the base image itself is a huge helper in the SBOM. At endoflife.date, we provide EOL information for various products alongside their PURLs. These include PURLs for docker images. See these search results. For example, for composer, we provide the following PURLs:

-   purl: pkg:composer/composer/composer
-   repology: php:composer # this expands to various packages listed at https://repology.org/project/php:composer/versions
-   purl: pkg:docker/library/composer
-   purl: pkg:github/composer/composer

Of these, the pkg:docker one is the relevant one. Say I have a PHP application that uses the official composer base image:

FROM composer:2.6.2
ADD . /src

If you were to build such a dockerfile, Syft would not include the version of composer in the SBOM, because Syft currently does not detect composer. The official composer dockerfile relies on a bash installer for composer, which drops a few binaries in the image. I've reported such issues in the past, but I believe the binary classifier can only get us so far.

In such a scenario, since the SBOM doesn't include it, the usage (potentially EOL) goes unnoticed and undetected.

However, if Syft were to report the base image used here (pkg:docker/library/composer@2.6.2), it would provide a secondary means of such detection.

tl;dr: Providing base images in the SBOM acts as a decent fallback, and includes important information (such as repository names, organization name, image version/tag) that is relevant to security teams.

noqcks commented 10 months ago

@captn3m0 can this issue be closed now after https://github.com/anchore/syft/issues/2267 has been merged, or did you have more in mind for this issue?

spiffcs commented 8 months ago

@captn3m0 What else do you have in mind? Now that we have the annotations do we want to try and build the base image "package" into the other formats? What's your end Ideal state for syft in how it surfaces base images now that #2267 has been merged?

For best results so consumers of the document can find the base image via relationships we should use: https://spdx.github.io/spdx-spec/v2.3/relationships-between-SPDX-elements/

Cyclonedx: https://cyclonedx.org/docs/1.5/json/#metadata_component

The other outstanding question is are the annotations the best source of truth for discovering this information? Can there be multiple images that would build the full chain from image:primary -> image:base1 -> image:base2 -> scratch

The properties of the annotations also need more information to properly identify the image. ubuntu:xx.xx today can be different from ubuntu:xx.xx one month ago. We need both the digest and the version to pin down the exact image used.

captn3m0 commented 8 months ago

What's your end Ideal state for syft in how it surfaces base images

A PURL that points to the correct base image. While #2294 is great, those are not components. Anything that is outside of the "components" part of the BOM will not get picked up by any other tooling.

Ideally, this would use the OCI PURL type, with the optional tag attribute (https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#oci).

I like the idea of using relationships to document this better, but I'm not sure which of the available relationships will work best here. Base images can be counted as build dependencies, composition primitives, or even ancestors. Hard to pick something that works best for all cases.

Can there be multiple images that would build the full chain from image:primary -> image:base1 -> image:base2 -> scratch

Yes, this is another reason I'd prefer using components as well, since there the BOM could all of the known base images (although finding them is a much harder problem).

We need both the digest and the version to pin down the exact image used.

This should be solvable with oci PURLs. Sample PURL from the spec, that includes both digest and tag: pkg:oci/static@sha256%3A244fd47e07d10?repository_url=gcr.io/distroless/static&tag=latest

spiffcs commented 1 month ago

I've added the blocked label to this. There is still currently no agreed on trusted space for the base images SBOM or package information to be accessed from.

Annotations is not where the syft project wants to pull this data from as it's too reliant on the user input as far as "trusting" what the contents of a given base is.

I've added needs discussion to this for our livestream this week so that the team can discuss the future of this:

https://youtube.com/live/T9OkSGu23j4?feature=share

wagoodman commented 1 month ago

Note for later: is there any OCI attestations for base images in docker hub that we could leverage here?

willmurphyscode commented 1 month ago

What we need mechanism for going from the layer digest of an image to the tag or tags that point at it. If such data source existed, we'd be open to making Syft query it at runtime, similar to querying maven central to identify a JAR by its digest. However, right now, we don't know of such a data source.

The needs-investigation label means someone should go an look for a mechanism to sort of revers the lookup that Docker does when it sees FROM node:lts-alpine3.19 and decides which bytes to download. It might be possible this dataset exists somewhere, or that we can compute it.

bureado commented 1 month ago

In case it helps the ongoing research:

  1. https://stackoverflow.com/a/67927907
  2. https://docs.docker.com/build/attestations/