anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.07k stars 562 forks source link

Provide a way to get the LayerID the package was first found in #435

Open jonathongardner opened 3 years ago

jonathongardner commented 3 years ago

A flag that would include the layerID the package first showed up in.

When tracking down a package (maybe b/c it has vulnerabilities or Im not sure why its in my SBOM) it would be helpful to know what layer it first showed up in so I can look at the commands run to generate that layer.

Currently it looks like the layerID returned under locations is the last layerID the path was touched in. So for example if I did something like:

# layer1
FROM alpine
# layer2
RUN apk add --update nodejs-current=15.10.0-r0

The package “busybox” version "1.32.1-r6" has a layerID of layer2. I know there is the “–scope all-layers” option but that would also return any packages that were removed from the final image.

wagoodman commented 3 years ago

The work proposed in #32 aims to deduplicate packages in a way where the same package found in multiple locations would be listed as a single package and have multiple entries on the .locations array on the artifact (in the json format). If this was implemented then the first item on the .locations array would answer the question of "where was the first instance of the particular package found".

jonathongardner commented 2 years ago

@wagoodman I see https://github.com/anchore/syft/issues/32 was closed. I was checking it out and it looks good i dont think it really solves this problem though (though its a little more helpful). The issue still exist if i run --scope all-layers i can see the layer the package first shows up in (and now because of the deduplicate its in one package which is somewhat helpful) but i still get packages that might not be in the final image (I can provide an example of this if needed). If i run it without --scope all-layers than it still only returns the last layer the component was touched in (and for deb/alpine packages that confusing because the package manger DB is touched whenever i do an install so its always the last layer i do a package manger install).

Right now what im having to do to get around this is run syft with --scope squashed then create an array of package ids (so i know what packages are in the final image) than run syft with --scope all-layers and filter out packages not in the package ids array

wagoodman commented 1 year ago

I think there is a path forward on this one. We would need to create a new image-based FileResolver that would act a little like the squashed resolver and the all-layers resolver. The squashed resolver returns a location for all paths in the squashed representation. The all-layers resolver returns one or more locations to the all paths in all layers.

We really want something that would return all locations from all layers for all paths in the squashed representation. In this way the catalogers would have visibility into all places where the file was introduced/changed and the existing downstream package merging logic would account for packages that are the same and found in the same path across multiple layers.

This could be selectable by a new scope like --scope squashed-with-all-layers (a terrible name, but just as an example).

From an implementation point of view, this would look an awful lot like the existing all-layers resolver today with an additional filtering step based on a query to the squashed representation. The catalogers would catalog all location instances, raising up duplicates, and the set of duplicates would be merged. The single merged package would have pkg.Locations populated with all layers which the package definitions were found in.

This means that for a dpkg that was added on layer 1, but other packages were installed in other (future) layers, since there is a shared database there would be a location added to the package for every layer which the database file was modified from the starting layer (when the package was installed) moving forward. This case is a little awkward, but is accurate relative to what syft understands about the package, and seems like a good first step.

tomerse-sg commented 6 months ago

what is the status of this request? can be very useful :)

tomersein commented 1 month ago

please look at this pr - https://github.com/anchore/syft/pull/3138