Add support for package dependency relationships

wagoodman commented 3 years ago

What would you like to be added: Support tracking the full dependency graph for packages in the form of relationships, for the ecosystems that support extracting this information.

Why is this needed: An SBOM is useful for at least listing what makes up a software artifact. However, it is more useful to know how a dependency is related to the artifact (is it a direct dependency? or a transitive dependency? is this dependency used by several other packages, or just one?).

Below is a list of each ecosystem that we could implement this for (really it's a list of all of the parsers for all catalogers). It doesn't mean that we should implement this entire list, there are some ecosystems that just don't raise up enough information to make adding relationships useful. This will have to be taken on a case-by-case basis.

[x] Apk #1063
[x] Dpkg #2040
[x] ALPM #2851
[x] Conan (conan.lock)
[ ] Conan (conaninfo.txt)
[x] Conan (conanfile.txt)
[ ] Dart (pubspec.lock)
[x] .NET (deps.json) #2143
[ ] .NET (from binary)
[ ] Elixir (mix.lock)
[ ] Erlang (rebar.lock)
[ ] Github actions workflows (workflows using actions)
[ ] Golang (go.mod) #2353 ... _hold off on doing this until the new golang source cataloger lands @spiffcs ⌛
[x] Golang (binary) #2912
[ ] Haskell (stack.yaml)
[ ] Haskell (stack.yaml.lock)
[ ] Haskell (cabal.project.freeze)
[ ] Java (nested jars)
[x] #3189
[ ] Java (gradle.lockfile)
[ ] Javascript (package.json) #3108
[ ] Javascript (package-lock.json) #2348 #2305 #3109
[ ] Javascript (yarn.lock) #2305
[ ] Javascript (pnpm-lock.yaml) #2305
[ ] Kernel modules #1694 <-- this PR adds kernel-to-module relationships, but we don't denote dependencies of components within those modules. Is that's what's needed here?
[ ] PHP (installed.json)
[ ] PHP (composer.lock)
[ ] Portage (contents file)
[x] Python (poetry.lock) #2906
[x] Python (egg/wheel metadata) #2903
[ ] R (description file)
[x] RPM (db) #2872
[ ] Ruby (gemfile.lock)
[ ] Ruby (specifications gemspec)
[ ] Rust (cargo.lock) #2353
[ ] Rust (binary)
[ ] SBOM
[ ] Swift (package.resolved)
[ ] Swift (Podfile.lock)

These are catalogers that have been deemed not possible / practical to implement raise up relationships for at this time:

Nix (store): there is no available metadata to reference
Python (setup.py): no relationship information
Python (requirements.txt): no relationship information
Python (Pipfile.lock): no relationship information to the package this is being installed for
RPM (rpm file): there is relationship data, but not with potentially other installed packages

Notes: This assumes that https://github.com/anchore/syft/issues/556 is implemented, allowing for package catalogers to return relationships as first class evidence.

hectorj2f commented 2 years ago

@wagoodman I started to play with the package dependencies for cyclonedx https://github.com/hectorj2f/syft/tree/hectorj2f/add_dependencies_to_cyclonedx. I am only generating the dependencies for the components as cyclonedx format recommends. Let me know if you prefer to open a PR for that.

wagoodman commented 2 years ago

Some detail here regarding which ecosystems this will be feasible for in a static-analysis sense (not reaching out to external data sources, such as maven central).

SPDX 2.2 relationships are used to describe what will be added to the artifact package in terms of new relationship types. The used relationships in the breakdown below are:

RUNTIME_DEPENDENCY_OF
DEV_DEPENDENCY_OF
BUILD_DEPENDENCY_OF
DEPENDENCY_OF

Question: we might not be able to accurately determine build-vs-runtime dependency depending on the lack of context (e.g. python). Should we just use DEPENDENCY_OF instead in these cases? (final answer: yes)

apk

Summary: direct runtime dependencies

The D: section lists pull dependencies, which is a space delimited list of package names and categorized dependencies. E.g. D:scanelf so:libc.musl-x86_64.so.1.
One problem is being able to figure which of the dependencies are package names, and which are other requirements. Relationships:
RUNTIME_DEPENDENCY_OF for any packages listed as pull dependencies
question: should we make package-to-file relationships for so: dependencies that are found by the resolver?

dpkg

Summary: direct runtime dependencies

The Depends and Pre-Depends sections hold information about dependencies: "Both depends and pre-depends mention the dependencies a package needs before installing but pre-depends forces the installation and configuration of the dependency packages before even starting with the package that needs the dependencies" source

Relationships:

RUNTIME_DEPENDENCY_OF for any packages listed as dependencies

golang

go.mod

Summary: flat-subset of transitive build dependencies.

go.mod contains a subset of transitive dependencies (but possibly more than direct dependencies). Cannot reason about dependency-to-dependency relationships

Relationships:

BUILD_DEPENDENCY_OF for any package listed in the go.mod
For go.mod, we currently cannot determine if a dependency is for testing or not.
go binary buildinfo section

Summary: flat-transitive build dependencies.

go binary buildinfo section contains transitive dependencies. Cannot reason about dependency-to-dependency relationships

Relationships:

BUILD_DEPENDENCY_OF for any package listed in the binary buildinfo section.

java

pom.xml

Summary: flat-direct build dependencies.

There is a <dependencies> section which describes direct build dependencies

Relationships:

BUILD_DEPENDENCY_OF for any package listed the dependency section.

manifest

Does not contain dependency information

javascript

yarn.lock

Summary: flat psuedo-transitive runtime dependency pins

Each package has a "dependencies" section, which lists only direct dependencies. Dependencies of these dependencies is NOT tracked. Also the dependencies are not pinned versions

Relationships:

RUNTIME_DEPENDENCY_OF for any packages listed in the yarn.lock
cannot use the "dependencies" section within each package (since it is not a pin)

package.lock

Summary: transitive dependency pins with full dependency-to-dependency graph

Under the "dependencies" each pinned version specifies a loose version requirements for any packages that the pinned dependency requires under the "requires" section.
The requires section always has names that map back to the same name in the "dependencies" section.

Relationships:

RUNTIME_DEPENDENCY_OF for any packages listed as dependencies
full dependency graph possible (dependency-to-dependency relationships)

package.json

Summary: flat-direct runtime and dev dependency version ranges.

The "dependencies" section has a map of name:version values for direct dependencies
The "devDependencies" section has name:version values for direct dev dependencies
Note: version values are not pins, but range specifiers

Relationships:

RUNTIME_DEPENDENCY_OF for any packages listed as dependencies
DEV_DEPENDENCY_OF for any packages listed in devDependencies

php

composer.lock

Summary: direct dependency version pins

Direct dependencies only within the "packages" and "packages-dev" sections
versions are pinned

Relationships:

RUNTIME_DEPENDENCY_OF for any packages listed as dependencies
DEV_DEPENDENCY_OF for any packages listed in devDependencies

installed.json

Summary: No relationships possible

list of installed package versions and their required packages, required packages are only loose version specifiers

Relationships:

it's not clear that any relationships can be supported.

python

poetry

Summary: flat-transitive dependency dev and runtime relationships

Lists transitive dependencies in flat fashion (no dependency-to-dependency relationships)
Each dependency can be categorized (main and dev) Relationships:
Can only describe that the main package relates to packages described within the poetry.lock (not a full transitive dependency graph)
DEV_DEPENDENCY_OF for packages with a "dev" category
~BUILD_DEPENDENCY_OF for packages with a "main" category (question: should this be RUNTIME_DEPENDENCY_OF since these dependencies are not required in the python compile step for generating pycs?)~ We cannot really distinguish these in all cases, so it is safer to use DEPENDENCY_OF

pipfile

Summary: flat-transitive dependency dev and runtime relationships

Lists transitive dependencies in flat fashion (no dependency-to-dependency relationships)
Each dependency goes underneath one of two json sections: default and develop Relationships:
Can only describe that the default package relates to packages described within the lockfile (not a full transitive dependency graph)
DEV_DEPENDENCY_OF for packages with a "develop" category
~BUILD_DEPENDENCY_OF for packages with a "default" category (question: should this be RUNTIME_DEPENDENCY_OF since these dependencies are not required in the python compile step for generating pycs?)~ We cannot really distinguish these in all cases, so it is safer to use DEPENDENCY_OF

egg / dist metadata

Does not describe any relationships

bureado commented 2 years ago

Can you elaborate more on the why for this feature? From my reading of it, what you are trying to do is to determine why a package foo of version bar made it into the thing you are scanning with syft.

If so, I fear dumping the dependency tree might not answer that question. Parsing the package manager operations log can approximate an answer, but the only deterministic way I’m aware of to do this is to perform actual process introspection during the image build to know exactly what ended up calling e.g. dpkg -i over a file on disk.

Conversely, with a purl that is differentiated enough you can augment the syft output with dependencies and much more metadata that is publicly known and available. If syft says nano version 1.2 is in this Ubuntu container of release foo, anyone can readily obtain the dependencies of that package from public sources.

Don’t get me wrong, I’m a fan of taking as much primary source data from the package manager in the scanned instance as possible. And I think flat SBOMs can be limited in many scenarios (log4j being just the latest widely covered scenario) But I think how the feature surfaces, what it tries to solve and how it changes the syft experience for people that are expecting a flat output might be worth additional consideration.

Thank you for working on syft and for helping syft users and the industry realize better outcomes through a thoughtful approach to the existing package manager metadata.

bureado commented 2 years ago

Adding one thing to my comment above. It’s possible that the “why” for this feature is not “what other binary depended on this binary/made this binary materialize” but more of a transitive “what other software was needed to make this binary that then went into my image” and that’s where Build-Depends and Built-Using (in the case of dpkg) would be more useful but the in-artifact package manager metadata might not contain that information. Ideally, that information would carry in each packages own SBOM but in practice the trend seems to be that metadata will live in publicly queriable services. Meaning that maybe this augmentation of syft output could be a post-analyze stage?

wagoodman commented 2 years ago

@bureado thanks for your thoughts on this --we chatted a lot about this at a recent community meeting and internally as well... I wanted to expose some of these conversations here in the issue as well.

Why do this feature? That's a fair question, and one that we've been exploring before trying to take it on. Squarely put, a list of packages without how they relate won't be able to answer questions about what could have introduced a package into the artifact.

Take for example, knowing that you have log4j installed is very useful, though if your intent is to remove it you need to know how it got introduced. Is it a direct dependency of your application? Did another package bring it in? Maybe both? It happens that for java packages the syft pkg.metadata.virtualPath is a good indicator for some of this, but it's heavily encoded... and the same equivalent field isn't present for all ecosystems. Bringing in relationships to raise up common descriptions of what is in the underlying data makes sense in this case.

Same can be said for vulnerability analysis. I see that I'm vulnerable to CVE-X-Y for this package, when combing this with VEX information in the future that can indicate applicability of a CVE from the publisher's perspective, knowing through which path in the dependency tree the vulnerability match is for starts to matter... this is only achievable by knowing the relationships between packages.

External data has richer relationship information. This is generally (nearly universally) true. Many ecosystems don't express full connectivity information between packages, however, their public repository (e.g. PyPI, Maven central, rubygems.org, etc) have this information and with some external querying you can get a better understanding of package-to-package relationships.

Sometime in the near future we want to add in features that allow syft to leverage external data in an opt-in capacity. However, we do have enough raw information from the underlying artifact to convey package-to-package connectivity in most ecosystems (and we're trying to be forward with the limitations for each ecosystem in https://github.com/anchore/syft/issues/572#issuecomment-1000412163).

Does the existence of better connectivity data externally indicate that we should not express package-to-package relationships? Or that we should hold off until we do have this ability to query external sources? My take is that we can introduce this feature but allow for configurability of it (be able to change behavior or the source of this connectivity information, or turn it off altogether).

But I think how the feature surfaces, what it tries to solve and how it changes the syft experience for people that are expecting a flat output might be worth additional consideration.

I 100% agree with this. We still want to provide a flat list of packages, so no change there. This would add additional elements in the relationships section of the SBOM. If it's the sheer number of additional relationships that would be the problem, then that future points to having configuration to turn off or augment this functionality.

Sorry for the radio silence on this @bureado , but happy to continue chatting about this.

wagoodman commented 2 years ago

from refinement:

this issue should not get picked up directly for work, but instead we should be creating new issues to account for each ecosystem... not byte them all off at once.

fproulx-boostsecurity commented 2 years ago

We'd love for this to be supported! How far is this on the roadmap ? Or at least, I cannot make it work now.

VijayKumarMidde commented 2 years ago

+1. would love to see this feature on Syft. Is this feature on the roadmap?

Hritik14 commented 1 year ago

@wagoodman

External data has richer relationship information

This is something that is easily available now for public use (https://deps.dev). Are there any plans for incorporating the same ?

setchy commented 1 year ago

Ditto - I find that this feature would be incredibly helpful, particularly when using tools like DependencyTrack to visualize the dependency graph. Trivy has support for maintaining dependency relationships

markgalpin commented 1 year ago

@wagoodman so I was looking at the parsing of java archives, in the context of an effort to think about Vex document hierarchies and cycloneDX over a particular dataset of containers.

As far as I can tell, currently Syft doesn't provide any "Relationship" information package-to-package with java archive parsing, currently the archive parser recursively takes a known java archive object and checks what's inside based on the manifest files -- anecdotally the archive parser seems to be what's most commonly invoked when handed a production container running java. But there certainly IS a relationship if you are only reporting on the presence of one library because it was shipped inside the archive for another.

While opinions vary, generally from an SBOM perspective when we talk about a "dependency" we mean "if there's a problem with this, there may be a problem with thing depending on it", or for use cases about bringing it in, as discussed elsewhere. And in THAT sense, the hierarchical information derived from the archive parsing seems like its valid dependencies, even if you don't go into the next level of sorting out the pom files. That doesn't mean that the extra compile-scope issues in the pom couldn't be relevant. But knowing, when processing an SBOM that the issue reported in jc-core is because that's a library inside the netty-common uberjar... is actually pretty valuable.

Changing syft to output the hierarchy when extracting from java archives isn't that hard. I could maybe PR it (I built a POC of it after I found issue #1972 because I needed an example of maven for my purposes). Then you get into the different TYPES of relationships, should this be dependencyOf or Contains...

One thing I do think about is that from a CyloneDX perspective, I would be inclined to say that any package-to-package relationship counts as a "Dependency" for its purpose. Although anecdotally, in terms of current syft output this seems to mostly just arise in OS packages containing library package types such as python etc. Anything that makes SBOMs less flat is good for a variety of use cases.

As a note, processing NPM seems a bit harder within the current code framework. Right now for NPM the standard behavior of cataloger is to parse a package json to retrieve a single package, so as I understand the code architecture, to get the list of all npm packages for relationships to correctly display one bomref to another you'd need to do it at the end of the run, and then process the dependencies?

wagoodman commented 9 months ago

I want to revisit this statement for a bit:

SPDX 2.2 relationships are used to describe what will be added to the artifact package in terms of new relationship types. > The used relationships in the breakdown below are:

RUNTIME_DEPENDENCY_OF

DEV_DEPENDENCY_OF

BUILD_DEPENDENCY_OF

DEPENDENCY_OF

Question: we might not be able to accurately determine build-vs-runtime dependency depending on the lack of context (e.g. python). Should we just use DEPENDENCY_OF instead in these cases? ...final answer: yes

I think there could be a compromise here to get the best of both worlds. The main problem with using all 4 relationship types is that it makes it a little harder for consumers to use (they need to know about all types and union the graph together). The problem with using only DEPENDENCY_OF is that it's lossy, which isn't ideal when you're trying to discern nuance.

The compromise I propose is this: In syft JSON use DEPENDENCY_OF , but annotate the Data field of the relationship with additional dependency qualities (such as is it a dev dependency, runtime, build, etc):

https://github.com/anchore/syft/blob/da31eed6374de15a4b684c34fa7e63c770878190/syft/artifact/relationship.go#L37-L42

Even if the struct was something simple like:

type DependencyKind struct {
  Runtime bool
  Development bool
  BuildTime bool
}

would be a step forward, since it would allow for multiple options to be true without muddling the graph with more edges than necessary.

I feel that this would make a good trade off in terms of making graph traversal easier to grok without loosing information.

spiffcs commented 8 months ago

Linking the latest and greatest SPDX 3.0 relationship types as a dev note for those picking this up on a per ecosystem basis: https://spdx.github.io/spdx-spec/v3.0/model/Core/Vocabularies/RelationshipType/#

wagoodman commented 8 months ago

Team consensus from our weekly gardening meeting is to not tackle https://github.com/anchore/syft/issues/572#issuecomment-1932781666 , meaning we will only have DEPENDENCY_OF. Note: this means that if something is a dev, build, or dependency then it will still be captured as DEPENDENCY_OF. In the future we might still try and tackle adding edge qualifications or more edges of various types... but not on the first pass.

anchore / syft