Open jonathongardner opened 3 years ago
The work proposed in #32 aims to deduplicate packages in a way where the same package found in multiple locations would be listed as a single package and have multiple entries on the .locations
array on the artifact (in the json format). If this was implemented then the first item on the .locations
array would answer the question of "where was the first instance of the particular package found".
@wagoodman I see https://github.com/anchore/syft/issues/32 was closed. I was checking it out and it looks good i dont think it really solves this problem though (though its a little more helpful). The issue still exist if i run --scope all-layers
i can see the layer the package first shows up in (and now because of the deduplicate its in one package which is somewhat helpful) but i still get packages that might not be in the final image (I can provide an example of this if needed). If i run it without --scope all-layers
than it still only returns the last layer the component was touched in (and for deb/alpine packages that confusing because the package manger DB is touched whenever i do an install so its always the last layer i do a package manger install).
Right now what im having to do to get around this is run syft with --scope squashed
then create an array of package ids (so i know what packages are in the final image) than run syft with --scope all-layers
and filter out packages not in the package ids array
I think there is a path forward on this one. We would need to create a new image-based FileResolver that would act a little like the squashed resolver and the all-layers resolver. The squashed resolver returns a location for all paths in the squashed representation. The all-layers resolver returns one or more locations to the all paths in all layers.
We really want something that would return all locations from all layers for all paths in the squashed representation. In this way the catalogers would have visibility into all places where the file was introduced/changed and the existing downstream package merging logic would account for packages that are the same and found in the same path across multiple layers.
This could be selectable by a new scope like --scope squashed-with-all-layers
(a terrible name, but just as an example).
From an implementation point of view, this would look an awful lot like the existing all-layers resolver today with an additional filtering step based on a query to the squashed representation. The catalogers would catalog all location instances, raising up duplicates, and the set of duplicates would be merged. The single merged package would have pkg.Locations
populated with all layers which the package definitions were found in.
This means that for a dpkg that was added on layer 1, but other packages were installed in other (future) layers, since there is a shared database there would be a location added to the package for every layer which the database file was modified from the starting layer (when the package was installed) moving forward. This case is a little awkward, but is accurate relative to what syft understands about the package, and seems like a good first step.
what is the status of this request? can be very useful :)
please look at this pr - https://github.com/anchore/syft/pull/3138
A flag that would include the layerID the package first showed up in.
When tracking down a package (maybe b/c it has vulnerabilities or Im not sure why its in my SBOM) it would be helpful to know what layer it first showed up in so I can look at the commands run to generate that layer.
Currently it looks like the layerID returned under locations is the last layerID the path was touched in. So for example if I did something like:
The package “busybox” version "1.32.1-r6" has a layerID of layer2. I know there is the “–scope all-layers” option but that would also return any packages that were removed from the final image.