anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.31k stars 579 forks source link

Add support for package dependency relationships #572

Open wagoodman opened 3 years ago

wagoodman commented 3 years ago

What would you like to be added: Support tracking the full dependency graph for packages in the form of relationships, for the ecosystems that support extracting this information.

Why is this needed: An SBOM is useful for at least listing what makes up a software artifact. However, it is more useful to know how a dependency is related to the artifact (is it a direct dependency? or a transitive dependency? is this dependency used by several other packages, or just one?).

Below is a list of each ecosystem that we could implement this for (really it's a list of all of the parsers for all catalogers). It doesn't mean that we should implement this entire list, there are some ecosystems that just don't raise up enough information to make adding relationships useful. This will have to be taken on a case-by-case basis.

These are catalogers that have been deemed not possible / practical to implement raise up relationships for at this time:

Notes: This assumes that https://github.com/anchore/syft/issues/556 is implemented, allowing for package catalogers to return relationships as first class evidence.

hectorj2f commented 2 years ago

@wagoodman I started to play with the package dependencies for cyclonedx https://github.com/hectorj2f/syft/tree/hectorj2f/add_dependencies_to_cyclonedx. I am only generating the dependencies for the components as cyclonedx format recommends. Let me know if you prefer to open a PR for that.

wagoodman commented 2 years ago

Some detail here regarding which ecosystems this will be feasible for in a static-analysis sense (not reaching out to external data sources, such as maven central).

SPDX 2.2 relationships are used to describe what will be added to the artifact package in terms of new relationship types. The used relationships in the breakdown below are:

Question: we might not be able to accurately determine build-vs-runtime dependency depending on the lack of context (e.g. python). Should we just use DEPENDENCY_OF instead in these cases? (final answer: yes)

apk

Summary: direct runtime dependencies

dpkg

Summary: direct runtime dependencies

Relationships:

golang

go.mod

Summary: flat-subset of transitive build dependencies.

Relationships:

Summary: flat-transitive build dependencies.

Relationships:

java

pom.xml

Summary: flat-direct build dependencies.

Relationships:

manifest

Does not contain dependency information

javascript

yarn.lock

Summary: flat psuedo-transitive runtime dependency pins

Relationships:

package.lock

Summary: transitive dependency pins with full dependency-to-dependency graph

Relationships:

package.json

Summary: flat-direct runtime and dev dependency version ranges.

Relationships:

php

composer.lock

Summary: direct dependency version pins

Relationships:

installed.json

Summary: No relationships possible

Relationships:

python

poetry

Summary: flat-transitive dependency dev and runtime relationships

pipfile

Summary: flat-transitive dependency dev and runtime relationships

egg / dist metadata

Does not describe any relationships

bureado commented 2 years ago

Can you elaborate more on the why for this feature? From my reading of it, what you are trying to do is to determine why a package foo of version bar made it into the thing you are scanning with syft.

If so, I fear dumping the dependency tree might not answer that question. Parsing the package manager operations log can approximate an answer, but the only deterministic way I’m aware of to do this is to perform actual process introspection during the image build to know exactly what ended up calling e.g. dpkg -i over a file on disk.

Conversely, with a purl that is differentiated enough you can augment the syft output with dependencies and much more metadata that is publicly known and available. If syft says nano version 1.2 is in this Ubuntu container of release foo, anyone can readily obtain the dependencies of that package from public sources.

Don’t get me wrong, I’m a fan of taking as much primary source data from the package manager in the scanned instance as possible. And I think flat SBOMs can be limited in many scenarios (log4j being just the latest widely covered scenario) But I think how the feature surfaces, what it tries to solve and how it changes the syft experience for people that are expecting a flat output might be worth additional consideration.

Thank you for working on syft and for helping syft users and the industry realize better outcomes through a thoughtful approach to the existing package manager metadata.

bureado commented 2 years ago

Adding one thing to my comment above. It’s possible that the “why” for this feature is not “what other binary depended on this binary/made this binary materialize” but more of a transitive “what other software was needed to make this binary that then went into my image” and that’s where Build-Depends and Built-Using (in the case of dpkg) would be more useful but the in-artifact package manager metadata might not contain that information. Ideally, that information would carry in each packages own SBOM but in practice the trend seems to be that metadata will live in publicly queriable services. Meaning that maybe this augmentation of syft output could be a post-analyze stage?

wagoodman commented 2 years ago

@bureado thanks for your thoughts on this --we chatted a lot about this at a recent community meeting and internally as well... I wanted to expose some of these conversations here in the issue as well.

Why do this feature? That's a fair question, and one that we've been exploring before trying to take it on. Squarely put, a list of packages without how they relate won't be able to answer questions about what could have introduced a package into the artifact.

Take for example, knowing that you have log4j installed is very useful, though if your intent is to remove it you need to know how it got introduced. Is it a direct dependency of your application? Did another package bring it in? Maybe both? It happens that for java packages the syft pkg.metadata.virtualPath is a good indicator for some of this, but it's heavily encoded... and the same equivalent field isn't present for all ecosystems. Bringing in relationships to raise up common descriptions of what is in the underlying data makes sense in this case.

Same can be said for vulnerability analysis. I see that I'm vulnerable to CVE-X-Y for this package, when combing this with VEX information in the future that can indicate applicability of a CVE from the publisher's perspective, knowing through which path in the dependency tree the vulnerability match is for starts to matter... this is only achievable by knowing the relationships between packages.

External data has richer relationship information. This is generally (nearly universally) true. Many ecosystems don't express full connectivity information between packages, however, their public repository (e.g. PyPI, Maven central, rubygems.org, etc) have this information and with some external querying you can get a better understanding of package-to-package relationships.

Sometime in the near future we want to add in features that allow syft to leverage external data in an opt-in capacity. However, we do have enough raw information from the underlying artifact to convey package-to-package connectivity in most ecosystems (and we're trying to be forward with the limitations for each ecosystem in https://github.com/anchore/syft/issues/572#issuecomment-1000412163).

Does the existence of better connectivity data externally indicate that we should not express package-to-package relationships? Or that we should hold off until we do have this ability to query external sources? My take is that we can introduce this feature but allow for configurability of it (be able to change behavior or the source of this connectivity information, or turn it off altogether).

But I think how the feature surfaces, what it tries to solve and how it changes the syft experience for people that are expecting a flat output might be worth additional consideration.

I 100% agree with this. We still want to provide a flat list of packages, so no change there. This would add additional elements in the relationships section of the SBOM. If it's the sheer number of additional relationships that would be the problem, then that future points to having configuration to turn off or augment this functionality.

Sorry for the radio silence on this @bureado , but happy to continue chatting about this.

wagoodman commented 2 years ago

from refinement:

fproulx-boostsecurity commented 2 years ago

We'd love for this to be supported! How far is this on the roadmap ? Or at least, I cannot make it work now.

VijayKumarMidde commented 2 years ago

+1. would love to see this feature on Syft. Is this feature on the roadmap?

Hritik14 commented 1 year ago

@wagoodman

External data has richer relationship information

This is something that is easily available now for public use (https://deps.dev). Are there any plans for incorporating the same ?

setchy commented 1 year ago

Ditto - I find that this feature would be incredibly helpful, particularly when using tools like DependencyTrack to visualize the dependency graph. Trivy has support for maintaining dependency relationships

markgalpin commented 1 year ago

@wagoodman so I was looking at the parsing of java archives, in the context of an effort to think about Vex document hierarchies and cycloneDX over a particular dataset of containers.

As far as I can tell, currently Syft doesn't provide any "Relationship" information package-to-package with java archive parsing, currently the archive parser recursively takes a known java archive object and checks what's inside based on the manifest files -- anecdotally the archive parser seems to be what's most commonly invoked when handed a production container running java. But there certainly IS a relationship if you are only reporting on the presence of one library because it was shipped inside the archive for another.

While opinions vary, generally from an SBOM perspective when we talk about a "dependency" we mean "if there's a problem with this, there may be a problem with thing depending on it", or for use cases about bringing it in, as discussed elsewhere. And in THAT sense, the hierarchical information derived from the archive parsing seems like its valid dependencies, even if you don't go into the next level of sorting out the pom files. That doesn't mean that the extra compile-scope issues in the pom couldn't be relevant. But knowing, when processing an SBOM that the issue reported in jc-core is because that's a library inside the netty-common uberjar... is actually pretty valuable.

Changing syft to output the hierarchy when extracting from java archives isn't that hard. I could maybe PR it (I built a POC of it after I found issue #1972 because I needed an example of maven for my purposes). Then you get into the different TYPES of relationships, should this be dependencyOf or Contains...

One thing I do think about is that from a CyloneDX perspective, I would be inclined to say that any package-to-package relationship counts as a "Dependency" for its purpose. Although anecdotally, in terms of current syft output this seems to mostly just arise in OS packages containing library package types such as python etc. Anything that makes SBOMs less flat is good for a variety of use cases.

As a note, processing NPM seems a bit harder within the current code framework. Right now for NPM the standard behavior of cataloger is to parse a package json to retrieve a single package, so as I understand the code architecture, to get the list of all npm packages for relationships to correctly display one bomref to another you'd need to do it at the end of the run, and then process the dependencies?

wagoodman commented 9 months ago

I want to revisit this statement for a bit:

SPDX 2.2 relationships are used to describe what will be added to the artifact package in terms of new relationship types. > The used relationships in the breakdown below are:

  • RUNTIME_DEPENDENCY_OF
  • DEV_DEPENDENCY_OF
  • BUILD_DEPENDENCY_OF
  • DEPENDENCY_OF

Question: we might not be able to accurately determine build-vs-runtime dependency depending on the lack of context (e.g. python). Should we just use DEPENDENCY_OF instead in these cases? ...final answer: yes

I think there could be a compromise here to get the best of both worlds. The main problem with using all 4 relationship types is that it makes it a little harder for consumers to use (they need to know about all types and union the graph together). The problem with using only DEPENDENCY_OF is that it's lossy, which isn't ideal when you're trying to discern nuance.

The compromise I propose is this: In syft JSON use DEPENDENCY_OF , but annotate the Data field of the relationship with additional dependency qualities (such as is it a dev dependency, runtime, build, etc):

https://github.com/anchore/syft/blob/da31eed6374de15a4b684c34fa7e63c770878190/syft/artifact/relationship.go#L37-L42

Even if the struct was something simple like:

type DependencyKind struct {
  Runtime bool
  Development bool
  BuildTime bool
}

would be a step forward, since it would allow for multiple options to be true without muddling the graph with more edges than necessary.

I feel that this would make a good trade off in terms of making graph traversal easier to grok without loosing information.

spiffcs commented 8 months ago

Linking the latest and greatest SPDX 3.0 relationship types as a dev note for those picking this up on a per ecosystem basis: https://spdx.github.io/spdx-spec/v3.0/model/Core/Vocabularies/RelationshipType/#

wagoodman commented 8 months ago

Team consensus from our weekly gardening meeting is to not tackle https://github.com/anchore/syft/issues/572#issuecomment-1932781666 , meaning we will only have DEPENDENCY_OF. Note: this means that if something is a dev, build, or dependency then it will still be captured as DEPENDENCY_OF. In the future we might still try and tackle adding edge qualifications or more edges of various types... but not on the first pass.