Import transitive dependencies from SBOMs if available

ppkarwasz commented 2 months ago

Some Maven libraries publish shaded artifacts that contain many if not all their dependencies.

Since it is impossible to guess which artifacts were shaded from the POM file alone, the CycloneDX plugin should try to use the CycloneDX SBOMs of their dependencies, if available.

This feature request is related to #472 .

VinodAnandan commented 2 months ago

@ppkarwasz Thanks for creating this issue. I had a similar idea that I discussed with @hboutemy in CycloneDX Slack, but I failed to provide him with concrete examples. Is it possible for you or anyone else (Cc: @raboof , @prabhu, @lfrancke, @stevespringett ) interested in this feature to provide some concrete examples?

lfrancke commented 2 months ago

Examples of where this would be useful?

ppkarwasz commented 2 months ago

In many cases shaded artifacts are the final product and are not consumed by other Java artifacts. They end up in the binary tar.gz distribution of an application, so they are not a problem for CycloneDX Maven plugin.

There are however valid (or at least justified) cases, when a library shades another and often repackages it (in the sense that it changes the names of Java packages):

Shading and repackaging ASM was a common practice. Now it is less common, since ASM has a stable API.
pax-logging-log4j shades log4j-core and makes minor modifications to improve its OSGi support. This case is very unfortunate since versions of pax-logging-log4j2 prior to 2.0.13 are affected by at least one of the Log4Shell CVE's.
tomcat-dbcp is a repackaged version of commons-dbcp2 with the logging API replaced with tomcat-juli (which itself is a repackaged version of an old commons-logging).
Most of the bouncy-castle artifacts might be considered a "shaded" version of another BC artifact.

I consider SBOMs as an build tool and language independent way to expose a project's dependencies. It would be useful to use them to complement Maven's simplified dependency system, e.g. regarding conflicts.

raboof commented 2 months ago

As scenario:

Imagine:

You're using SBOMs to scan for advisories: you want to know what advisories exist for a product p and its parts.
This project has a dependency (d) that shades artifact a: the d jar contains all classes from a, but moved to a different package
d publishes an SBOM that correctly reports the fact that d contains a (e.g. through #472 or otherwise)
There is an advisory published for a
The SBOM for p is created with cyclonedx-maven-plugin

When running the vulnerability scanner, it should identify that p is potentially affected by the advisory for a. There are two approaches the vulnerability scanner could learn about the fact that a is part of p:

If the vulnerability scanner sees the dependency of p on d, fetches the SBOM for d, and finds out about the shaded a from there
If cyclonedx-maven-plugin sees the dependency of p on d, fetches the SBOM for d, and uses this information to include a into the SBOM for p (i.e., the feature described in this issue). The vulnerability scanner then takes this information from the SBOM of p.

So the choice is between going implementing this in all vulnerability scanners (first approach), or implementing this in all SBOM generators (including cyclonedx-maven-plugin, second approach). AFAICS there is no obvious 'architectural' reason to choose one or the other. For 'regular' dependencies, you definitely want the second approach (because the pom of p may influence which transitive dependencies of d would get picked, so looking at d's SBOM would not be accurate for these). For 'shaded' dependencies, either approach would work. The fact that you want the second approach for 'regular' dependencies might be a motivation to go for the second approach and implement this in SBOM generators such as cyclonedx-maven-plugin.

prabhu commented 2 months ago

For cases like these, we need to go beyond the package names to a vulnerability database that offer affected modules, imports, symbols, etc, which doesn't exist in the open-source world. When running cdxgen with --deep argument, the Namespaces belonging to each package would also get collected and stored as an internal property, so some work on the SBOM side is possible.

hboutemy commented 2 months ago

to the examples of shaded content shared previously, I'd add one typical case: in the same gav, there are both the initial .jar and one shaded one, like https://repo1.maven.org/maven2/org/apache/maven/wagon/wagon-http/3.5.3/

on this case, what should THE sbom contain to describe the 2 different jars? how would a project consuming one of these jars as a dependency know what to use? Additional question: as wagon project is a multi-module build, what about the aggregate SBOM vs the gav-only ones? And this question about aggregated is valid both from a producer perspective (wagon) and a consumer perspective (a project consuming one artifact of wagon)?

has really cyclonedx-maven-plugin a chance to magically detect different case without user deep configuration? How many additional files will have the plugin to download to do the advanced analysis?

notice: is this specific to the java world or do other ecosystems have such cases?

there are serious deep dives discussion to have to get the whole picture

raboof commented 2 months ago

in the same gav, there are both the initial .jar and one shaded one, like https://repo1.maven.org/maven2/org/apache/maven/wagon/wagon-http/3.5.3/

on this case, what should THE sbom contain to describe the 2 different jars?

Even though those are in the same gav, shouldn't we treat those jars as different artifacts and thus create different SBOMs for them?

there are serious deep dives discussion to have to get the whole picture

Indeed!

ppkarwasz commented 2 months ago

on this case, what should THE sbom contain to describe the 2 different jars? how would a project consuming one of these jars as a dependency know what to use?

I think that the SBOM should describe all the artifacts sharing the same GAV (at least the binary ones). Some will be described as components, while other as assemblies. The classifier and type property of a pURL should be enough to make them apart.

A complex example, jakartaee-migration has 3 assemblies:

a shaded.jar,
a bin.zip,
and a bin.tar.gz.

BTW: I think that if VEX-es become compulsory, developers will think twice before publishing this kind of assemblies. jakartaee-migration contains commons-compress and is vulnerable to all its CVEs.

CycloneDX / cyclonedx-maven-plugin

Import transitive dependencies from SBOMs if available #497