Open wagoodman opened 3 years ago
@wagoodman I started to play with the package dependencies for cyclonedx https://github.com/hectorj2f/syft/tree/hectorj2f/add_dependencies_to_cyclonedx. I am only generating the dependencies for the components as cyclonedx format recommends. Let me know if you prefer to open a PR for that.
Some detail here regarding which ecosystems this will be feasible for in a static-analysis sense (not reaching out to external data sources, such as maven central).
SPDX 2.2 relationships are used to describe what will be added to the artifact
package in terms of new relationship types. The used relationships in the breakdown below are:
RUNTIME_DEPENDENCY_OF
DEV_DEPENDENCY_OF
BUILD_DEPENDENCY_OF
DEPENDENCY_OF
Question: we might not be able to accurately determine build-vs-runtime dependency depending on the lack of context (e.g.
python). Should we just use DEPENDENCY_OF
instead in these cases? (final answer: yes)
Summary: direct runtime dependencies
D:
section lists pull dependencies, which is a space delimited list of package names and categorized dependencies. E.g. D:scanelf so:libc.musl-x86_64.so.1
. RUNTIME_DEPENDENCY_OF
for any packages listed as pull dependenciesso:
dependencies that are found by the resolver? Summary: direct runtime dependencies
Depends
and Pre-Depends
sections hold information about dependencies: "Both depends and pre-depends mention the dependencies a package needs before installing but pre-depends forces the installation and configuration of the dependency packages before even starting with the package that needs the dependencies" sourceRelationships:
RUNTIME_DEPENDENCY_OF
for any packages listed as dependenciesSummary: flat-subset of transitive build dependencies.
Relationships:
BUILD_DEPENDENCY_OF
for any package listed in the go.modSummary: flat-transitive build dependencies.
Relationships:
BUILD_DEPENDENCY_OF
for any package listed in the binary buildinfo section.Summary: flat-direct build dependencies.
<dependencies>
section which describes direct build dependenciesRelationships:
BUILD_DEPENDENCY_OF
for any package listed the dependency section.Does not contain dependency information
Summary: flat psuedo-transitive runtime dependency pins
Relationships:
RUNTIME_DEPENDENCY_OF
for any packages listed in the yarn.lockSummary: transitive dependency pins with full dependency-to-dependency graph
Relationships:
RUNTIME_DEPENDENCY_OF
for any packages listed as dependenciesSummary: flat-direct runtime and dev dependency version ranges.
Relationships:
RUNTIME_DEPENDENCY_OF
for any packages listed as dependenciesDEV_DEPENDENCY_OF
for any packages listed in devDependenciesSummary: direct dependency version pins
Relationships:
RUNTIME_DEPENDENCY_OF
for any packages listed as dependenciesDEV_DEPENDENCY_OF
for any packages listed in devDependenciesSummary: No relationships possible
Relationships:
Summary: flat-transitive dependency dev and runtime relationships
poetry.lock
(not a full transitive dependency graph)DEV_DEPENDENCY_OF
for packages with a "dev" category BUILD_DEPENDENCY_OF
for packages with a "main" category (question: should this be RUNTIME_DEPENDENCY_OF
since these dependencies are not required in the python compile step for generating pyc
s?)~ We cannot really distinguish these in all cases, so it is safer to use DEPENDENCY_OF
Summary: flat-transitive dependency dev and runtime relationships
DEV_DEPENDENCY_OF
for packages with a "develop" category BUILD_DEPENDENCY_OF
for packages with a "default" category (question: should this be RUNTIME_DEPENDENCY_OF
since these dependencies are not required in the python compile step for generating pyc
s?)~ We cannot really distinguish these in all cases, so it is safer to use DEPENDENCY_OF
Does not describe any relationships
Can you elaborate more on the why for this feature? From my reading of it, what you are trying to do is to determine why a package foo of version bar made it into the thing you are scanning with syft.
If so, I fear dumping the dependency tree might not answer that question. Parsing the package manager operations log can approximate an answer, but the only deterministic way I’m aware of to do this is to perform actual process introspection during the image build to know exactly what ended up calling e.g. dpkg -i over a file on disk.
Conversely, with a purl that is differentiated enough you can augment the syft output with dependencies and much more metadata that is publicly known and available. If syft says nano version 1.2 is in this Ubuntu container of release foo, anyone can readily obtain the dependencies of that package from public sources.
Don’t get me wrong, I’m a fan of taking as much primary source data from the package manager in the scanned instance as possible. And I think flat SBOMs can be limited in many scenarios (log4j being just the latest widely covered scenario) But I think how the feature surfaces, what it tries to solve and how it changes the syft experience for people that are expecting a flat output might be worth additional consideration.
Thank you for working on syft and for helping syft users and the industry realize better outcomes through a thoughtful approach to the existing package manager metadata.
Adding one thing to my comment above. It’s possible that the “why” for this feature is not “what other binary depended on this binary/made this binary materialize” but more of a transitive “what other software was needed to make this binary that then went into my image” and that’s where Build-Depends and Built-Using (in the case of dpkg) would be more useful but the in-artifact package manager metadata might not contain that information. Ideally, that information would carry in each packages own SBOM but in practice the trend seems to be that metadata will live in publicly queriable services. Meaning that maybe this augmentation of syft output could be a post-analyze stage?
@bureado thanks for your thoughts on this --we chatted a lot about this at a recent community meeting and internally as well... I wanted to expose some of these conversations here in the issue as well.
Why do this feature? That's a fair question, and one that we've been exploring before trying to take it on. Squarely put, a list of packages without how they relate won't be able to answer questions about what could have introduced a package into the artifact.
Take for example, knowing that you have log4j installed is very useful, though if your intent is to remove it you need to know how it got introduced. Is it a direct dependency of your application? Did another package bring it in? Maybe both? It happens that for java packages the syft pkg.metadata.virtualPath
is a good indicator for some of this, but it's heavily encoded... and the same equivalent field isn't present for all ecosystems. Bringing in relationships to raise up common descriptions of what is in the underlying data makes sense in this case.
Same can be said for vulnerability analysis. I see that I'm vulnerable to CVE-X-Y for this package, when combing this with VEX information in the future that can indicate applicability of a CVE from the publisher's perspective, knowing through which path in the dependency tree the vulnerability match is for starts to matter... this is only achievable by knowing the relationships between packages.
External data has richer relationship information. This is generally (nearly universally) true. Many ecosystems don't express full connectivity information between packages, however, their public repository (e.g. PyPI, Maven central, rubygems.org, etc) have this information and with some external querying you can get a better understanding of package-to-package relationships.
Sometime in the near future we want to add in features that allow syft to leverage external data in an opt-in capacity. However, we do have enough raw information from the underlying artifact to convey package-to-package connectivity in most ecosystems (and we're trying to be forward with the limitations for each ecosystem in https://github.com/anchore/syft/issues/572#issuecomment-1000412163).
Does the existence of better connectivity data externally indicate that we should not express package-to-package relationships? Or that we should hold off until we do have this ability to query external sources? My take is that we can introduce this feature but allow for configurability of it (be able to change behavior or the source of this connectivity information, or turn it off altogether).
But I think how the feature surfaces, what it tries to solve and how it changes the syft experience for people that are expecting a flat output might be worth additional consideration.
I 100% agree with this. We still want to provide a flat list of packages, so no change there. This would add additional elements in the relationships
section of the SBOM. If it's the sheer number of additional relationships that would be the problem, then that future points to having configuration to turn off or augment this functionality.
Sorry for the radio silence on this @bureado , but happy to continue chatting about this.
from refinement:
We'd love for this to be supported! How far is this on the roadmap ? Or at least, I cannot make it work now.
+1. would love to see this feature on Syft. Is this feature on the roadmap?
@wagoodman
External data has richer relationship information
This is something that is easily available now for public use (https://deps.dev). Are there any plans for incorporating the same ?
Ditto - I find that this feature would be incredibly helpful, particularly when using tools like DependencyTrack to visualize the dependency graph. Trivy has support for maintaining dependency relationships
@wagoodman so I was looking at the parsing of java archives, in the context of an effort to think about Vex document hierarchies and cycloneDX over a particular dataset of containers.
As far as I can tell, currently Syft doesn't provide any "Relationship" information package-to-package with java archive parsing, currently the archive parser recursively takes a known java archive object and checks what's inside based on the manifest files -- anecdotally the archive parser seems to be what's most commonly invoked when handed a production container running java. But there certainly IS a relationship if you are only reporting on the presence of one library because it was shipped inside the archive for another.
While opinions vary, generally from an SBOM perspective when we talk about a "dependency" we mean "if there's a problem with this, there may be a problem with thing depending on it", or for use cases about bringing it in, as discussed elsewhere. And in THAT sense, the hierarchical information derived from the archive parsing seems like its valid dependencies, even if you don't go into the next level of sorting out the pom files. That doesn't mean that the extra compile-scope issues in the pom couldn't be relevant. But knowing, when processing an SBOM that the issue reported in jc-core is because that's a library inside the netty-common uberjar... is actually pretty valuable.
Changing syft to output the hierarchy when extracting from java archives isn't that hard. I could maybe PR it (I built a POC of it after I found issue #1972 because I needed an example of maven for my purposes). Then you get into the different TYPES of relationships, should this be dependencyOf or Contains...
One thing I do think about is that from a CyloneDX perspective, I would be inclined to say that any package-to-package relationship counts as a "Dependency" for its purpose. Although anecdotally, in terms of current syft output this seems to mostly just arise in OS packages containing library package types such as python etc. Anything that makes SBOMs less flat is good for a variety of use cases.
As a note, processing NPM seems a bit harder within the current code framework. Right now for NPM the standard behavior of cataloger is to parse a package json to retrieve a single package, so as I understand the code architecture, to get the list of all npm packages for relationships to correctly display one bomref to another you'd need to do it at the end of the run, and then process the dependencies?
I want to revisit this statement for a bit:
SPDX 2.2 relationships are used to describe what will be added to the artifact package in terms of new relationship types. > The used relationships in the breakdown below are:
- RUNTIME_DEPENDENCY_OF
- DEV_DEPENDENCY_OF
- BUILD_DEPENDENCY_OF
- DEPENDENCY_OF
Question: we might not be able to accurately determine build-vs-runtime dependency depending on the lack of context (e.g. python). Should we just use DEPENDENCY_OF instead in these cases? ...final answer: yes
I think there could be a compromise here to get the best of both worlds. The main problem with using all 4 relationship types is that it makes it a little harder for consumers to use (they need to know about all types and union the graph together). The problem with using only DEPENDENCY_OF is that it's lossy, which isn't ideal when you're trying to discern nuance.
The compromise I propose is this: In syft JSON use DEPENDENCY_OF , but annotate the Data
field of the relationship with additional dependency qualities (such as is it a dev dependency, runtime, build, etc):
Even if the struct was something simple like:
type DependencyKind struct {
Runtime bool
Development bool
BuildTime bool
}
would be a step forward, since it would allow for multiple options to be true without muddling the graph with more edges than necessary.
I feel that this would make a good trade off in terms of making graph traversal easier to grok without loosing information.
Linking the latest and greatest SPDX 3.0 relationship types as a dev note for those picking this up on a per ecosystem basis: https://spdx.github.io/spdx-spec/v3.0/model/Core/Vocabularies/RelationshipType/#
Team consensus from our weekly gardening meeting is to not tackle https://github.com/anchore/syft/issues/572#issuecomment-1932781666 , meaning we will only have DEPENDENCY_OF. Note: this means that if something is a dev, build, or dependency then it will still be captured as DEPENDENCY_OF. In the future we might still try and tackle adding edge qualifications or more edges of various types... but not on the first pass.
What would you like to be added: Support tracking the full dependency graph for packages in the form of relationships, for the ecosystems that support extracting this information.
Why is this needed: An SBOM is useful for at least listing what makes up a software artifact. However, it is more useful to know how a dependency is related to the artifact (is it a direct dependency? or a transitive dependency? is this dependency used by several other packages, or just one?).
Below is a list of each ecosystem that we could implement this for (really it's a list of all of the parsers for all catalogers). It doesn't mean that we should implement this entire list, there are some ecosystems that just don't raise up enough information to make adding relationships useful. This will have to be taken on a case-by-case basis.
These are catalogers that have been deemed not possible / practical to implement raise up relationships for at this time:
Notes: This assumes that https://github.com/anchore/syft/issues/556 is implemented, allowing for package catalogers to return relationships as first class evidence.