Invoke known tools to gather build-time dependency information

kzantow commented 1 year ago

What would you like to be added: Add the ability to shell-out to known tools such as go and mvn in order to capture more accurate build-time dependency information.

Why is this needed: To improve the build-time dependency support in Syft.

Additional context: From a working document:

Creating higher quality SBOMs in Syft at build-time

At build time, static analysis of dependencies implemented today is limited. Improving static analysis metrics can be done by simulating what build systems do. This is subject to drift and additional maintenance to keep up with behaviors of the build systems.

One approach to resolving this issue is to call out to build systems to get that information instead. This introduces additional (optional) dependencies.

Syft as a build-time SBOM generator tool

Syft can be seen as a “build-time” SBOM generator tool, and can start thinking about utilizing build tooling. Calling out to build tools can be such as the following, we will use 2 examples.

Golang

Instead of reading and trying to parse and resolve the go.mod file, the go mod graph command can be used to get a fully resolved dependency tree.

[ ] #2018

Java

Maven has the mvn dependency:tree command which shows the fully-resolved dependency graph.

[ ] #2019

NPM

Npm has npm ls --all

[ ] #2020

Python

Python has pipdeptree

[ ] https://github.com/anchore/syft/issues/2023

Considerations

When using external tooling, version and parameter information should be captured
Warnings on quality of what is being used to generate it can be made visible, as well as suggestions on how to obtain better SBOMs (e.g. dependency pinning)

kzantow commented 1 year ago

cc: @lumjjb

wagoodman commented 1 year ago

This is most likely needed on order to achieve https://github.com/anchore/syft/issues/1674 and https://github.com/anchore/syft/issues/572 in a meaningful way.

This functionality should be opt-in, that is, by default syft should remain a static analysis tool. Executing other commands on the system should still be not allowed by default (again, unless the user opts in).

Considerations:

should these "external querying capabilities" be encapsulated into their own separate catalogers? For example go-mod-file-cataloger stays as it is today and allow for a new go-tooling-cataloger. In this way opting in would be adding a cataloger (or enabling a flag which would automatically swap out one cataloger for another)... or should we go in the direction of keeping the existing catalogers today that behave differently based on configuration? (one assumption Im making by going down this path is that the impl for the go.mod cataloging today is mutually exclusive to using the build tooling)
we probably don't want to find duplicate packages by doing a static analysis and tooling query, there should be an obvious mechanism for enforcing mutual exclusivity for existing (static) analysis and tooling analysis.
even if a new cataloger is not used to encapsulate this behavior it should be obvious to the user that this was found via a tooling query vs looking at just the go.mod contents (more than just the application configuration probably).

setchy commented 1 year ago

would the same be true for npm, too?

noqcks commented 12 months ago

Instead of shelling out to cli tools, would you consider building parsers directly inside syft? It wouldn't require one to depend on the presence of local tooling, and I could envision that the tools might have different ouputs depending on the installed version of the tooling.

Snyk, for example, has built a bunch of parsers for various ecosystems in js https://github.com/snyk/dotnet-deps-parser

I just implemented something similar in cdxgen for .NET [ref] and npm [ref] to determine direct/indirect deps in build files. Wondering if the same could work here but written in golang.

Add the ability to shell-out to known tools such as go and mvn in order to capture more accurate build-time dependency information.

Can you expand on what specifically would be more accurate. In my mind I can only imagine direct/indirect deps. But is there more?

I also noticed that syft doesn't generate a dependencies section for CycloneDX for different language specific files (go.mod, package-lock.json). Would the outcome of this issue be that this section would be filled?

kzantow commented 12 months ago

@noqcks we do already have lots of parsers for different ecosystems. This change, at least initially, would be an opt-in behavior to shell out to the tools. This would allow things like Go - which has a flat list of dependencies in the go.mod - to get the dependency graph and properly output it in different formats.

noqcks commented 12 months ago

I suppose what I meant is only shelling out to tools where necessary (in the case that go mod graph is truly the only way to see the dependency tree for go projects), and writing all other dep graph parsers directly into syft where possible.

I'd like to work on getting real dependency graph for javascript projects inside syft, and wanted to write this dep parser inside syft instead of relying on an external npm cli.

Wanted to clarify whether this would be an appropriate avenue to pursue before I started the work.

wagoodman commented 5 months ago

Since this is a potentially large item that would affect multiple ecosystems I think a detailed plan is needed to move forward with this (how would this work within a single cataloger, what abstractions do we want to introduce (if any), would abstractions be generalizable to other ecosystem catalogers (if so, how), etc)

anchore / syft

Invoke known tools to gather build-time dependency information #1562