Open mykter opened 1 week ago
Valid request. And it will be even more valid once we support additional metadata such as occurrences.
The identity based de-duplication has always been there, but I think with the recent refactoring of BOM processing, as well as introduction of component property support, it's now more obvious.
De-duplication is a major concern for users who merge multiple BOMs prior to upload - most merge tools don't pay attention to duplicates, so it's up to DT to resolve that. There are also BOM generators out there that will produce duplicate component records for monorepos, or multi-module projects.
That being said, even in those cases, I'd expect properties outside of the core identity to match as well. So I'm inclined to say we should be able to just switch to full equality and be done with it.
If we need to maintain multiple ways, we could just make it a flag in the BOM upload request, defaulting to identity-based de-duplication.
What could be problematic are BOM generators that yield non-reproducible outputs. For example if they put timestamps or otherwise dynamic data in properties. In that case you'll get lots of churn whenever you re-upload BOMs to existing projects.
most merge tools don't pay attention to duplicates, so it's up to DT to resolve that. There are also BOM generators out there that will produce duplicate component records for monorepos, or multi-module projects.
I think there's a reasonable argument that it's up to the BOM producers to resolve that, not DT. Being able to say the BOM is the source of truth is a powerful simplifier, both for users and developers.
So I'm inclined to say we should be able to just switch to full equality and be done with it.
Sounds good! Would you be open to PRs to implement this? Would we need it behind an experimental flag?
Would you be open to PRs to implement this?
Most certainly.
Would we need it behind an experimental flag?
I think that would be good.
We can still decide to remove the flag later if we deem it unnecessary, but initially we should assume that there will be noticeable differences that users will need to "opt in" to.
I've been thinking about this some more and came up with a potential problem. Let's say you upgrade your BOM generator, and it adds a new metadata field as a property. I don't think anyone would expect this to cause a problem, but if we were using strict component equality then every vulnerability and policy violation would disappear and be recreated afresh the first time this new BOM was uploaded, with no triage status or notes etc.
So on either extreme we have:
Middle grounds I can think of:
In addition to purl/cpe/swid/name/version, I can see deviations in fields like these warranting separate components:
Option 4 feels safer to me, whilst still meeting the need to be able to represent multiple instances of the same component. It is more complex and subtle though.
Are there other options I'm not thinking of?
I think option 4 is going in the right direction - We need to find a minimal subset of component properties that can reliably uniquely identify a component.
I'm not sure if giving too much choice to clients is a good idea though. Ideally we would identify one "approved" way of doing things and run with it. The more opportunities for variation we offer, the farther away people's experiences will drift apart. It will be challenging to support users if the de-duplication is too customizable, if that makes sense.
In addition to purl/cpe/swid/name/version, I can see deviations in fields like these warranting separate components:
- pedigree
- call stack
- named properties
- occurrences
We definitely need to consider hashes as well. Probably also licenses.
RE occurrences: Consider that across project versions, the same component can appear in different places. Or additional occurrences can get added from one project version to the next. We wouldn't want the component to be recreated, just because it is imported from more locations. Call stack may have similar semantics.
Current Behavior
When uploading a BOM, components are merged based on their identity as defined here. Broadly speaking, if there are multiple components in the BOM with the same ID fields (PURL, CPE, name, version, etc), only one of them will be saved in the database.
Proposed Behavior
Some BOMs contain multiple components with the same identity, but differing properties for these components. The BOM may contain components from many different projects, some of which might include the same dependencies.
In this scenario we don't just want to know that we depend on component
A
, or that componentA
has a vulnerability - we want to know which projects depend onA
. We can only easily do that if every component is present in Dependency-Track as it is in the BOM.Somehow we would like to be able to upload a BOM, and know that after the upload, the project in Dependency-Track will exactly mirror the contents of the BOM. Options include:
dependency-track / component-merge / strict
Option 3a seems like a reasonable solution to me?
Checklist