Supply "depth" information when including relationships

kzantow commented 3 months ago

What would you like to be added: Relationship depth information, when Syft is unable to provide a full transitive dependency graph.

Why is this needed: One of the data elements mentioned in the NTIA minimum requirements is the depth of relationships. If Syft is able to build an accurate SBOM with a full transitive-dependency graph, that would be ideal, but different scenarios prevent this information from being included or accurately depicting the transitive graph. Some examples are Python requirements.txt and Go binary mod information, which only provide a flat list of dependencies. Or binaries which are only directly identified without dependent component information.

One solution is to provide an "unknown" indicator that Syft was unable to determine a full transitive dependency graph, or Syft stopped after 5-levels deep resolving online parent references. These can be returned as "unknowns" from catalogers where appropriate to be associated with the file(s) where package graph information originated.

Additional context: This is likely to be dependent the PR for known unknowns getting merged.

This is a part of #632

wagoodman commented 3 weeks ago

Related to #572

We want to be able to describe the topology and limitations of any dependency graph that an SBOM is producing. This isn't based on the SBOM as a whole, a language or packaging ecosystem, but really a package at a time based on the evidence we found and what we know about the kind of files that make up that evidence (e.g. package.json vs package-lock.json provide different answers here, which also differ when there is the existence of a populated node_modules dir from a previously run npm install command).

I feel that on a per-package bases we're looking for the following description:

From a capability perspective: Could we capture dependency information or not?
From a node-quality perspective: If we could capture dependencies to what extent depth-wise do we have the node information? That is, maybe we only have direct dependencies captured (partial), or we have all indirect dependencies listed as well (full).
From an edge-quality perspective: If we have full dependencies captured for a package, what is the quality of the relationships between these nodes? In some cases we only have a simple listing of dependencies with no real relationships (e.g. we know A and B are direct deps, and C and D are indirect deps, but we don't know by which means C and D were included [was it A? B? or both?]). Sometimes we can partition direct dependencies from transitive/indirect ones, other times we can't. Sometimes we have all direct dependency information for all nodes in the graph, thus can clearly describe all ways your application depends on any dependency (e.g. there are 13 path in the dependency graph that reach dep node D).

So how should we start expressing these topologies? I have an early/incomplete thought about a new field onto the pkg.Package called dependency with the following subfields:

nodes: with possible values...
- unknown: no distinction is made about if we're able to find any package dependencies
- direct-only: partial set of nodes, only describing direct dependencies
- all: all direct and indirect nodes are described
edges: with possible values...
- unknown: no distinction is made about if we're able to find any information about how dependencies are related to one another
- flat: nodes have relationships between both direct and indirect dependencies; cannot distinct between direct and indirect dependencies
- all: nodes have relationships between themselves and only direct dependencies

One question that comes to mind: what about cases where we can partition nodes into direct/indirect dependencies but it is still a flat list (like go.mod)? We can only say all/flat but it's still valuable to know which of these nodes are indirect. Does this mean we should add additional dependency information onto the edge itself? (in which case this is a non-point)

While I'm not sold on the specifics of the field, I think I'm becoming more convinced that describing the node and edge qualities separately is more valuable then attempting to combine them into a single enum field.

Another consideration is that there are nodes in the graph that cross ecosystems, combining nodes making up dependency graphs in one ecosystem with another dependency graph for another ecosystem. One example of this is with binary packages: these may relate to any number of other ecosystems based on file ownership overlap and dynamic imports (and soon dlopen descriptions) from that binary. So it may not be as simple as having an ecosystem cataloger make a claim on a package about it's node/edge/capability conclusion... this may additionally be a post-cataloging analysis that further annotates these qualities based on the final graph captured.

Thoughts to be continued in another post soon...

wagoodman commented 3 weeks ago

From a discussion with the team on this one, we nudged this into a different direction. The conclusive point of discussion was: when asking a single package node information about dependencies it shouldn't attempt to answer anything outside it's immediate dependencies. That is, asking a node to describe the graph isn't really correct. We should instead limit the answer to only the immediate part of the graph that the node is privy to.

This somewhat eliminates the need to describe edges in such depth. The current suggestion from the team is to have a single dependencies field with the following possible enum values:

unknown: no distinction is made about if we're able to find any package dependencies
complete-direct-only: the full set of direct dependencies are enumerated
complete-transitive: the full set of direct and indirect dependencies (mixed) are enumerated
incomplete: a partial set of dependencies are enumerated (with no distinction about if they are direct or indirect)

Furthermore, to open back up a conversation from #572, we should be qualifying edges that are known direct dependencies vs are known transitive (indirect) dependencies. In the common case of direct dependencies, using the dependency-of relationship type is what we should continue to use. However, we should not use this relationship type when describing dependencies that are NOT direct dependencies --another type should be created for this purpose.

kzantow commented 3 weeks ago

I'm not sure why I hadn't looked this up before, but I should also note the related SPDX 3 field: https://spdx.github.io/spdx-spec/v3.0.1/model/Core/Vocabularies/RelationshipCompleteness/. This is defined on a one-to-many relationship element and isn't exactly the same thing as we were talking about but is very closely related, I think.

anchore / syft

Supply "depth" information when including relationships #3010