Use stricter identity comparison when merging components

mykter commented 1 week ago

Current Behavior

When uploading a BOM, components are merged based on their identity as defined here. Broadly speaking, if there are multiple components in the BOM with the same ID fields (PURL, CPE, name, version, etc), only one of them will be saved in the database.

Proposed Behavior

Some BOMs contain multiple components with the same identity, but differing properties for these components. The BOM may contain components from many different projects, some of which might include the same dependencies.

In this scenario we don't just want to know that we depend on component A, or that component A has a vulnerability - we want to know which projects depend on A. We can only easily do that if every component is present in Dependency-Track as it is in the BOM.

Somehow we would like to be able to upload a BOM, and know that after the upload, the project in Dependency-Track will exactly mirror the contents of the BOM. Options include:

Change the existing identity check to use full equality. This will presumably break some use cases that depend on the existing identity-based merging? (I'm not sure what these use cases are)
Add an option to use equality when uploading a BOM. This feels complex - the behaviour of a project could vary over time unexpectedly.
Add an option to a project, so it can be configured to use strict component equality a. a new flag at the project level (like the "active" toggle) b. define a special property and use that, e.g. dependency-track / component-merge / strict
Make the behaviour configurable instance-wide in the settings.

Option 3a seems like a reasonable solution to me?

Checklist

[X] I have read and understand the contributing guidelines
[X] I have checked the existing issues for whether this enhancement was already requested

nscuro commented 1 week ago

Valid request. And it will be even more valid once we support additional metadata such as occurrences.

The identity based de-duplication has always been there, but I think with the recent refactoring of BOM processing, as well as introduction of component property support, it's now more obvious.

De-duplication is a major concern for users who merge multiple BOMs prior to upload - most merge tools don't pay attention to duplicates, so it's up to DT to resolve that. There are also BOM generators out there that will produce duplicate component records for monorepos, or multi-module projects.

That being said, even in those cases, I'd expect properties outside of the core identity to match as well. So I'm inclined to say we should be able to just switch to full equality and be done with it.

If we need to maintain multiple ways, we could just make it a flag in the BOM upload request, defaulting to identity-based de-duplication.

What could be problematic are BOM generators that yield non-reproducible outputs. For example if they put timestamps or otherwise dynamic data in properties. In that case you'll get lots of churn whenever you re-upload BOMs to existing projects.

mykter commented 1 week ago

most merge tools don't pay attention to duplicates, so it's up to DT to resolve that. There are also BOM generators out there that will produce duplicate component records for monorepos, or multi-module projects.

I think there's a reasonable argument that it's up to the BOM producers to resolve that, not DT. Being able to say the BOM is the source of truth is a powerful simplifier, both for users and developers.

So I'm inclined to say we should be able to just switch to full equality and be done with it.

Sounds good! Would you be open to PRs to implement this? Would we need it behind an experimental flag?

nscuro commented 1 week ago

Would you be open to PRs to implement this?

Most certainly.

Would we need it behind an experimental flag?

I think that would be good.

We can still decide to remove the flag later if we deem it unnecessary, but initially we should assume that there will be noticeable differences that users will need to "opt in" to.

mykter commented 1 week ago

I've been thinking about this some more and came up with a potential problem. Let's say you upgrade your BOM generator, and it adds a new metadata field as a property. I don't think anyone would expect this to cause a problem, but if we were using strict component equality then every vulnerability and policy violation would disappear and be recreated afresh the first time this new BOM was uploaded, with no triage status or notes etc.

So on either extreme we have:

Use strict equality (as we discussed above): consumers need to deal with vulnerabilities and policy violations that get recreated on any change to the generated BOM
Use the existing identity based equality: consumers have to deal with not being able to represent multiple different instances of the same dependency in a BOM

Middle grounds I can think of:

Choose the behaviour on upload. In theory this allows the best of both worlds, as you could use strict equality most of the time, and identity based when your BOM changes in some way. In practice I don't think this is realistic - you can't be fiddling with automated BOM uploads for one-off activities, and you'd probably only notice you needed this behaviour when it was too late and all your vulnerabilities had been recreated.
Make the equality check configurable to some degree: start with the existing fields as a base (or perhaps a bigger default set?), select other fields that you want to include, and if a component doesn't match on all these fields then it's treated as distinct.
- It could be configured in the BOM upload request, at the project level, or globally. These options get progressively simpler (good!) and less flexible (bad). Arguably the per-request option can be made to behave like the per-project one by the client: use a consistent equality definition whenever you're uploading to a project.

In addition to purl/cpe/swid/name/version, I can see deviations in fields like these warranting separate components:

pedigree
call stack
named properties
occurrences

Option 4 feels safer to me, whilst still meeting the need to be able to represent multiple instances of the same component. It is more complex and subtle though.

Are there other options I'm not thinking of?

nscuro commented 1 week ago

I think option 4 is going in the right direction - We need to find a minimal subset of component properties that can reliably uniquely identify a component.

I'm not sure if giving too much choice to clients is a good idea though. Ideally we would identify one "approved" way of doing things and run with it. The more opportunities for variation we offer, the farther away people's experiences will drift apart. It will be challenging to support users if the de-duplication is too customizable, if that makes sense.

In addition to purl/cpe/swid/name/version, I can see deviations in fields like these warranting separate components:

pedigree

call stack

named properties

occurrences

We definitely need to consider hashes as well. Probably also licenses.

RE occurrences: Consider that across project versions, the same component can appear in different places. Or additional occurrences can get added from one project version to the next. We wouldn't want the component to be recreated, just because it is imported from more locations. Call stack may have similar semantics.

DependencyTrack / dependency-track