Add tight scoping to nodes in the dependency graph

stevespringett commented 1 year ago

Based on issues identified in https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/310 and which has been discussed at https://github.com/guacsec/guac/issues/594 along with a Slack discussion. on the topic, this enhancement will introduce tight scoping for nodes in the dependency graph. In doing so, CycloneDX will be able to represent components with differing dependency trees across different modules in the same BOM.

Credit to @knrc for discovering this issue and writing about it (blog post being published soon) and to @hboutemy for helping work through the issue with the Maven Plugin.

jkowalleck commented 1 year ago

looking forward to @knrc's blog post. Need to understand the topic, did not get it from the mentioned discussions.

knrc commented 1 year ago

@jkowalleck I've finally published the blog, it contains a few common examples of when this can happen and the tests within my PR contain two more.

stevespringett commented 1 year ago

The two recommended approaches in CycloneDX both involve component isolation.

1) Use component assemblies to specify all the components that are includes in each Maven model, establishing a hierarchy within the BOM. Components with the same PURL (but different BOM refs) could be represented independently under each structure and the corresponding dependency tree could accurately reflect that. 2) Use BOM-Link to externalize and link to each BOM. This accomplishes the same thing as option 1, but uses multiple BOM files.

This ticket is to potentially provide a third way of representing this without the use of component isolation.

stevespringett commented 1 year ago

One way to solve this in the spec is to implement the recommendation provided in https://trustification.io/2023/03/20/cyclonedx-maven-aggregate-bom-why-not-to-trust.html which uses something similar to a Merkel tree. The recommendation states:

In order to achieve this we need to have some way of enriching the ref attribute value so it represents not only the specific component but also the direct dependencies in its particular location within the dependency graph, for example including a hash which would be derived from each set of dependencies. This hash would need to be calculated from the leaves back to the root, with every deviation in the dependency tree propagating its way up to the root.

While this would work, it also introduces a number of other challenges. Looking for alternative approaches that:

Have a low impact on existing implementations
Have a low impact on existing consumption tools
Is backward compatible with previous CDX versions

hboutemy commented 1 year ago

Use component assemblies

now I think I understand this statement: you mean that having multiple components with the same purl in an assembly is ok for you, because there is the assembly (components in components), while having the same without the assembly additional components level is not ok?

stevespringett commented 1 year ago

having multiple components with the same purl in an assembly is ok for you, because there is the assembly (components in components), while having the same without the assembly additional components level is not ok?

That is correct.

Its a similar concept to LDAP or any other directory service such as:

cn=John Doe, ou=Sales, dc=example.com
cn=John Doe, ou=Legal, dc=example.com

is perfectly fine, but:

cn=John Doe, ou=Sales, dc=example.com
cn=John Doe, ou=Sales, dc=example.com

would result in deduplication because they are the same.

Similar concepts occur in CycloneDX where:

bom / components / component-1 / components / component-A
bom / components / component-2 / components / component-A

is perfectly fine, but

bom / components / component-A
bom / components / component-A

is not.

jkowalleck commented 1 year ago

https://github.com/CycloneDX/cyclonedx-node-npm/blob/main/docs/result.md
https://github.com/CycloneDX/cyclonedx-node-npm/blob/main/docs/component_deduplication.md
and the discussions linked in these two documents

jkowalleck commented 1 year ago

@stevespringett If I understand the topic correct, then it is similar to the one with Node-JS, where every module is encapsulated and has an own unique modeule-resolution-tree.

Already solved it without any need for changes in CDX spec. My practical solution – bare & flattened: https://github.com/CycloneDX/cyclonedx-node-npm/tree/main/demo/juice-shop/example-results

knrc commented 1 year ago

@jkowalleck It looks as if your flat version is the same as the what I've implemented in my PR for the cyclonedx maven plugin, you also have the same component being duplicated at the top level but with different bom-refs.

jkowalleck commented 1 year ago

[...] have the same component being duplicated at the top level but with different s bom-ref.

@knrc yes. because in my case the components are actually duplicated in the file system. they actually exist multiple times in the build artifact. I do not track runtime-resolution but actual components, which affect runtime-resolution.

read about the background:

https://github.com/CycloneDX/cyclonedx-node-npm/blob/main/docs/result.md
https://github.com/CycloneDX/cyclonedx-node-npm/blob/main/docs/component_deduplication.md
and the discussions linked in these two documents

knrc commented 1 year ago

@knrc yes. because in my case the components are actually duplicated in the file system. they actually exist multiple times in the build artifact. I do not track runtime-resolution but actual components, which affect runtime-resolution.

While these may not be duplicated in the filesystem these artifacts are duplicated in the build, and if packaged individually would be duplicated in a similar way. Note this came up while generating an aggregate SBOM for a multi-module project, not a SBOM for a specific project.

stevespringett commented 1 year ago

One way which may work and would not involve Merkle-type trees, would be to leverage a CDX v1.5 capability, currently under development, which adds evidence of why a component was introduced/named and the occurrences of that component on the filesystem. This was suggested by Brian Fox at Sonatype.

See https://github.com/CycloneDX/specification/issues/129

If we simply expanded the meaning of a dependency ref, we could account for a single occurrence of a component, and not the component itself.

Currently it reads

References a component by the components bom-ref attribute

If we changed the meaning to:

References a component, or an occurrence of a component, by the objects bom-ref attribute

then I think we would be able to have independent dependency graphs for multi module builds.

knrc commented 1 year ago

One way which may work and would not involve Merkle-type trees, would be to leverage a CDX v1.5 capability, currently under development, which adds evidence of why a component was introduced/named and the occurrences of that component on the filesystem. This was suggested by Brian Fox at Sonatype.

Note the Merkle-type trees are an implementation detail of the PR I submitted for the maven plugin. They were not intended to be anything other than an easy way of creating differing, opaque bom-refs based on the transitive dependencies of the artifact. I would not expect this detail to be in any potential spec changes.

See #129

If we simply expanded the meaning of a dependency ref, we could account for a single occurrence of a component, and not the component itself.

Currently it reads

References a component by the components bom-ref attribute

If we changed the meaning to:

References a component, or an occurrence of a component, by the objects bom-ref attribute

then I think we would be able to have independent dependency graphs for multi module builds.

I'll take a look at this over the weekend and see if I can catch up on the discussion/proposal. I'm travelling to the UK on Monday/Tuesday but will try to follow up en-route.

stevespringett commented 1 year ago

Is this an issue the community wants to solve given that there are two supported ways to accomplish isolation of dependency graphs today:

Use component assemblies to specify all the components that are includes in each Maven model, establishing a hierarchy within the BOM. Components with the same PURL (but different BOM refs) could be represented independently under each structure and the corresponding dependency tree could accurately reflect that.
Use BOM-Link to externalize and link to each BOM. This accomplishes the same thing as option 1, but uses multiple BOM files.

Updating the documentation from:

References a component by the components bom-ref attribute

to:

References a component, or an occurrence of a component, by the objects bom-ref attribute

would provide a third way of representing this data. The thing that CycloneDX has always done is be more prescriptive in its design - having guardrails in place so adopters are less likely to incorrectly use the spec. We've sacrificed flexibility and in return have dramatically eased the road to adoption.

My thoughts:

Having a third way to represent this goes against these principals
Although a simple change, it would break nearly all consumption tools

We are wrapping up v1.5. Any new changes need to be documentation only or bug fixes. No new changes to the core spec are planned at this point and all such changes would have to be moved to v1.6 or later.

What does the community think about this issue?

aloubyansky commented 1 year ago

Having multiple ways to represent the same "thing" is definitely a bad design decision. Even two alternatives is already confusing enough. For that reason I suspect assemblies and dependency graphs are not serving exactly the same purpose and so are not meant to be alternatives?

Which way is currently the recommended one today for multi-module Maven/Gradle projects? I would like to mention that differences in dependencies of the same Maven artifact appearing as a dependency in different modules are not that uncommon.

I could think of the following recommendations: 1) always use assemblies when generating an SBOM whether it's for a single project module or an aggregated one for the whole project for the reason of consistency and avoid confusing the consumers of these SBOMs by making it clear: it's always nested components; 2) use dependency graphs when generating SBOMs for individual project modules but use assemblies when generating an aggregated SBOM for the whole project. The issue here is that the same dependency information is recorded completely differently in these two cases and the consumers of the SBOMs will have to account for that. 3) use dependency graphs for individual project modules but forbid aggregation by merging content of the individual module SBOMs and force use of external links instead. 4) enhance the dependency graph model to allow proper merging during aggregation.

To compare both approaches (dependency graphs and assemblies) for aggregated SBOMs, I implemented both options in my own SBOM generator impl (I had to use my own for a few reasons) based on the specVersion 1.4. Here are the effects on the sizes of the generated SBOMs:

dependency graph-based approach: 7MB
assemblies approach: 138MB

The reason the dependency graph-based approach appears to be more compact is because it is and, imo, size matters here. The generator detects common dependency subgraphs, records them once and then references them where necessary, while in the assemblies-based approach it has to be reproduced fully at every nesting point.

Based on the above, I would vote for exploring an enhancement to the dependency graph model. I think it could be done in a backwards compatible way too. I described an idea in https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/310#issuecomment-1483403428 and a follow up https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/310#issuecomment-1483737700

aloubyansky commented 1 year ago

I think it could be done in a backwards compatible way too. I described an idea in CycloneDX/cyclonedx-maven-plugin#310 (comment) and a follow up CycloneDX/cyclonedx-maven-plugin#310 (comment)

I meant forward compatible, not backwards compatible, i.e. SBOMs generated before the enhancement should be readable by the tools that support the new model supporting the enhancement.

stevespringett commented 1 year ago

The two alternatives that exist are to use component assemblies or to externalize each module into its own BOM and simply reference them. Its the same strategy that is discussed in the NTIA documentation. Assemblies would leave you with one really large BOM while external would leave you with many smaller ones. In both cases, each module could be independently signed and compositions can be applied. For BOMs with assemblies, each assembly could be independently signed.

Dependencies can be applied to both of these methods.

always use assemblies when generating an SBOM whether it's for a single project module or an aggregated one for the whole project for the reason of consistency and avoid confusing the consumers of these SBOMs by making it clear: it's always nested components;

You wouldn't use assemblies for a single module project. All flat inventory is part of the metadata/component.

use dependency graphs when generating SBOMs for individual project modules but use assemblies when generating an aggregated SBOM for the whole project. The issue here is that the same dependency information is recorded completely differently in these two cases and the consumers of the SBOMs will have to account for that.

This doesn't make sense to me. Dependency graphs and inventory (assemblies or not) compliment each other. You would always want to use dependencies.

use dependency graphs for individual project modules but forbid aggregation by merging content of the individual module SBOMs and force use of external links instead.

All BOMs should have dependency graphs regardless. However, I think this advise is just the opposite of point 1. So my take away if I'm reading this correctly is to make a choice between always using assemblies or always externalizing BOMs. That may be fine for the Maven plugin, but that approach doesn't work with large system of systems where a combination of both approaches are quite common.

enhance the dependency graph model to allow proper merging during aggregation.

I don't know what this means.

The reason the dependency graph-based approach appears to be more compact is because it is and, imo, size matters here.

Yes, that would be true. However, the resulting SBOM cannot be used for some use cases such as file integrity monitoring where multiple instances of the same component may be present and all the metadata for each instance is tracked separately. Tracking them individually allows for many more use cases.

Based on the above, I would vote for exploring an enhancement to the dependency graph model.

We already have the ability to track individual occurrences of a component in v1.5, which leverages the same component metadata. A minor documentation change is all thats needed to make that a reality.

Regardless of the approach taken (modifying dependency model or using component occurrences), it will break all existing consumption tools which have taken years to mature to the point they are today.

Also, I'm not sure why the SBOM is so large. The largest SBOM I currently have has 9000 components and is 22MB. If you're including the full license text in the BOM, you may want to try to externalize them to reduce the size.

aloubyansky commented 1 year ago

We already have the ability to track individual occurrences of a component in v1.5, which leverages the same component metadata. A minor documentation change is all thats needed to make that a reality.

Sorry if I'm missing the obvious, where could I read about that?

Regardless of the approach taken (modifying dependency model or using component occurrences), it will break all existing consumption tools which have taken years to mature to the point they are today.

I don't yet see why it would be a breaking change, tbh.

Also, I'm not sure why the SBOM is so large. The largest SBOM I currently have has 9000 components and is 22MB. If you're including the full license text in the BOM, you may want to try to externalize them to reduce the size.

No, I don't include license texts at all. Is that project available publicly somewhere? Is that a multi-module Java project that isn't representing a single runtime? Just FYI, I heard an SBOM for OpenShift is 250MB, RHEL ~70MB.

aloubyansky commented 1 year ago

I wanted to elaborate on why the dependency model could benefit from an enhancement, given that @jkowalleck, @knrc and me have implemented "compact" dependency graph recording in three different codebases based on v1.4, which, as @jkowalleck said previously, means it is indeed already possible in v1.4.

It is possible in v1.4 at the "expense" of introducing a new component with a unique bom-ref every time it's found in a context where its dependencies are different compared to its other occurrences which does not depend on the package identified with a PURL itself - it's the same package. In case of NodeJS, I guess, @jkowalleck would argue that every such occurrence would have a different location in the filesystem layout and that'd be enough of a difference. That's not necessarily true for Maven artifacts though. The same Maven artifact, in terms of its binary content and POM (license, dependencies, etc) may (and will) appear in different contexts - it's the consumer that will manage the context in which the artifact is used. There is no way to manifest the same artifact (with the same metadata) just once today in an SBOM, we are forced to copy the same component along with all its metadata adjusting only the bom-ref. Sure, if there is some metadata that's different (license, some external link, etc) - that justifies a new component to record the new metadata. Otherwise, it looks confusing, in fact, how a bom-ref is assigned in one case and not the other, looks random.

As I described in https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/310#issuecomment-1483403428, the v1.4 model already separates Dependency entity from the Component entity, which makes sense. Today each component may be linked to only one dependency instance by its bom-ref. The proposed enhancement is to allow each component (with the same metadata) be referenced from multiple Dependency instances. This will also make it in-line with Maven/Gradle dependency model. Those familiar with the Maven/Gradle API around this know that Artifact (Components) and Dependency are separate entities and there could be multiple Dependency instances referencing the same Artifact, which makes perfect sense in these build systems. BTW, with this kind of model SBOMs could save on the final size even more, besides making the model closer to the model that's being manifested.

stevespringett commented 1 year ago

Support for individual occurrences of a component is included with CycloneDX v1.5 expanded evidence support. See https://github.com/CycloneDX/specification/issues/129#issuecomment-1483866157 for an example of how it looks. With a simple documentation change, we could simply add a dependency on an occurrence of a component, and one occurrence of a component could have a different dependency graph as another occurrence of a component.

I don't yet see why it would be a breaking change, tbh.

All existing consumption tools expect a dependency to include a reference to a component or service. Introducing a reference to another dependency object or an occurrence of a component will result in failure of all existing consumption tools. The strive for perfection will result in a broken dependency graph for tens of thousands of organizations until such time as consumption tools get updated. Keep in mind that the most popular consumption tool, OWASP Dependency-Track, is used by over 10K orgs and is responsible for analyzing over 300M components every month for known vulnerabilities. The organizations and federal agencies that use this tool, and others like it, will have broken dependency graphs until those tools get updated and those updates are rolled out.

I would highly recommend that the occurrence approach is leveraged since the facilities are already in place to support it. The only required change is documentation. I would also recommend that this behavior be optional (not the default) in the Maven plugin. For end consumers, this level of nuance will not help them identify and reduce risk, but a broken dependency graph will cripple their ability to do so.

aloubyansky commented 1 year ago

Support for individual occurrences of a component is included with CycloneDX v1.5 expanded evidence support. See #129 (comment) for an example of how it looks. With a simple documentation change, we could simply add a dependency on an occurrence of a component, and one occurrence of a component could have a different dependency graph as another occurrence of a component.

Thanks for the link. A couple of questions. 1) would evidence appear under component? 2) would each bom-ref mentioned under occurrences reference a component with complete metadata (basically a copy of the component for which the occurrence is recorded) or only a subset that's different from the component for which the occurrence is recorded? 3) is the way of recording multiple occurrences of the same component (by duplicating the whole component but assigning a unique bom-ref value to it) as can be done in v1.4 still acceptable from your perspective?

I don't yet see why it would be a breaking change, tbh.

All existing consumption tools expect a dependency to include a reference to a component or service. Introducing a reference to another dependency object or an occurrence of a component will result in failure of all existing consumption tools. The strive for perfection will result in a broken dependency graph for tens of thousands of organizations until such time as consumption tools get updated. Keep in mind that the most popular consumption tool, OWASP Dependency-Track, is used by over 10K orgs and is responsible for analyzing over 300M components every month for known vulnerabilities. The organizations and federal agencies that use this tool, and others like it, will have broken dependency graphs until those tools get updated and those updates are rolled out.

On one hand, I recognize your concern. On the other hand, I am sure you'll be facing these challenges going forward. The should be a way to introduce these kinds of changes, assuming the community agrees they are worth it, of course. These kinds of changes will break tools that try using parsers developed for previous version of the spec to parse SBOMs generated using the new spec. That's what the spec version is for after all. I am sure scanner vendors are closely following the development of the spec and if we can make the spec better, everyone should be onboard and the timing of introducing these changes could account for the adoption of the new spec by scanner vendors.

I would also recommend that this behavior be optional (not the default) in the Maven plugin.

Well, all it take to run into this situation is to exclude a dependency in one module and not exclude it in another one. If I exclude a dependency, I do it on purpose - I don't want it on the classpath in this module but I do want it on the classpath in another module. I don't see it as an "optional behavior". The generator simply has to recognize it and manifest it properly. Otherwise, it'd be a bug in the generator.

stevespringett commented 1 year ago

would evidence appear under component

Yes

would each bom-ref mentioned under occurrences reference a component with complete metadata (basically a copy of the component for which the occurrence is recorded) or only a subset that's different from the component for which the occurrence is recorded?

Yes

is the way of recording multiple occurrences of the same component (by duplicating the whole component but assigning a unique bom-ref value to it) as can be done in v1.4 still acceptable from your perspective?

Yes, that can still be done and is still acceptable in an assembly.

This issue has to be sorted this Friday June 23 for it to be included in v1.5.

aloubyansky commented 1 year ago

would evidence appear under component

Yes

Got it.

would each bom-ref mentioned under occurrences reference a component with complete metadata (basically a copy of the component for which the occurrence is recorded) or only a subset that's different from the component for which the occurrence is recorded?

Yes

Sorry, a complete copy or a subset that's different?

is the way of recording multiple occurrences of the same component (by duplicating the whole component but assigning a unique bom-ref value to it) as can be done in v1.4 still acceptable from your perspective?

Yes, that can still be done and is still acceptable in an assembly.

Would it be considered acceptable w/o assemblies? I.e. recording everything in dependency graphs?

The issue is that "assemblies" don't seem to be a reasonable alternative. It's 7Mb vs 138Mb in my case.

Thanks.

stevespringett commented 1 year ago

Cannot get enough community involvement and feedback in time for v1.5 release. Moving to v1.6.

stevespringett commented 5 months ago

Moving to v1.7. Still needs to be flushed out.

CycloneDX / specification

Add tight scoping to nodes in the dependency graph #197