anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.12k stars 563 forks source link

Supply "depth" information when including relationships #3010

Open kzantow opened 3 months ago

kzantow commented 3 months ago

What would you like to be added: Relationship depth information, when Syft is unable to provide a full transitive dependency graph.

Why is this needed: One of the data elements mentioned in the NTIA minimum requirements is the depth of relationships. If Syft is able to build an accurate SBOM with a full transitive-dependency graph, that would be ideal, but different scenarios prevent this information from being included or accurately depicting the transitive graph. Some examples are Python requirements.txt and Go binary mod information, which only provide a flat list of dependencies. Or binaries which are only directly identified without dependent component information.

One solution is to provide an "unknown" indicator that Syft was unable to determine a full transitive dependency graph, or Syft stopped after 5-levels deep resolving online parent references. These can be returned as "unknowns" from catalogers where appropriate to be associated with the file(s) where package graph information originated.

Additional context: This is likely to be dependent the PR for known unknowns getting merged.

This is a part of #632

wagoodman commented 3 weeks ago

Related to #572

We want to be able to describe the topology and limitations of any dependency graph that an SBOM is producing. This isn't based on the SBOM as a whole, a language or packaging ecosystem, but really a package at a time based on the evidence we found and what we know about the kind of files that make up that evidence (e.g. package.json vs package-lock.json provide different answers here, which also differ when there is the existence of a populated node_modules dir from a previously run npm install command).

I feel that on a per-package bases we're looking for the following description:

So how should we start expressing these topologies? I have an early/incomplete thought about a new field onto the pkg.Package called dependency with the following subfields:

One question that comes to mind: what about cases where we can partition nodes into direct/indirect dependencies but it is still a flat list (like go.mod)? We can only say all/flat but it's still valuable to know which of these nodes are indirect. Does this mean we should add additional dependency information onto the edge itself? (in which case this is a non-point)

While I'm not sold on the specifics of the field, I think I'm becoming more convinced that describing the node and edge qualities separately is more valuable then attempting to combine them into a single enum field.

Another consideration is that there are nodes in the graph that cross ecosystems, combining nodes making up dependency graphs in one ecosystem with another dependency graph for another ecosystem. One example of this is with binary packages: these may relate to any number of other ecosystems based on file ownership overlap and dynamic imports (and soon dlopen descriptions) from that binary. So it may not be as simple as having an ecosystem cataloger make a claim on a package about it's node/edge/capability conclusion... this may additionally be a post-cataloging analysis that further annotates these qualities based on the final graph captured.

Thoughts to be continued in another post soon...

wagoodman commented 3 weeks ago

From a discussion with the team on this one, we nudged this into a different direction. The conclusive point of discussion was: when asking a single package node information about dependencies it shouldn't attempt to answer anything outside it's immediate dependencies. That is, asking a node to describe the graph isn't really correct. We should instead limit the answer to only the immediate part of the graph that the node is privy to.

This somewhat eliminates the need to describe edges in such depth. The current suggestion from the team is to have a single dependencies field with the following possible enum values:

Furthermore, to open back up a conversation from #572, we should be qualifying edges that are known direct dependencies vs are known transitive (indirect) dependencies. In the common case of direct dependencies, using the dependency-of relationship type is what we should continue to use. However, we should not use this relationship type when describing dependencies that are NOT direct dependencies --another type should be created for this purpose.

kzantow commented 3 weeks ago

I'm not sure why I hadn't looked this up before, but I should also note the related SPDX 3 field: https://spdx.github.io/spdx-spec/v3.0.1/model/Core/Vocabularies/RelationshipCompleteness/. This is defined on a one-to-many relationship element and isn't exactly the same thing as we were talking about but is very closely related, I think.