Open wagoodman opened 1 year ago
Another example is org.springframework.security
:spring-security-web
:6.0.3
:
The fields bom-ref
, purl
and cpe
are wrong, and the field group
is missing it should be set to org.springframework.security
):
{
"bom-ref": "pkg:maven/spring-security-web/spring-security-web@6.0.3?package-id=26ff02e1092a0036",
"type": "library",
"name": "spring-security-web",
"version": "6.0.3",
"purl": "pkg:maven/spring-security-web/spring-security-web@6.0.3",
"cpe": "cpe:2.3:a:spring-security-web:spring-security-web:6.0.3:*:*:*:*:*:*:*",
"externalReferences": [
{
"url": "",
"hashes": [
{
"alg": "SHA-1",
"content": "7a2c26fd8e1c0f6709fbb034f76d3ef50ea5b929"
}
],
"type": "build-meta"
}
],
"properties": [
{
"name": "syft:package:foundBy",
"value": "java-cataloger"
},
As I needed correct values so that the SBOM vulnerability scanner could correctly recognize the components, I ended up having to merge syft's SBOM file with Trivy's SBOM file, as Trivy is good at identifying Maven packages (but bad at OS-level packages for instance). Then it works great.
@wagoodman when syft is being used online, it might also be possible for us to compute a SHA of the artifact and use it to search Maven Central for the right group ID and artifact ID. (Happy to make a separate issue for this suggestion, but IMO it belongs here since the dataset is sort of an offline version of the maven central search, so they know about each other.)
Just a small update on options here, I've been able to take the maven index tooling and data and with some modifications output it to a sqlite DB comprised of group id, artifact id, version, and sha1 for each jar in maven central. This is ~3.5 GB of raw data, but normalized in the DB and compressed for distribution (tar.zst) is ~500MB. Though not great, this also isn't terrible. This could be trimmed down further to only include "problematic artifacts", or, those which do not appropriately have packaged the artifact ID and group ID... which means syft would positively leverage the sha1 lookup for identification.
One thing that I think comes due with being able to pick this up for work (beyond the operationalization of processing and hosting the data for syft to download) is some thoughts around how common patterns like this could be in other ecosystems. That is, we might be able to augment package identification for more catalogers if we had small "side-car" datasets, either local or remote, maybe even ad-hoc (request per-package). If so, could these patterns have a single facade for catalogers to not have to worry about the request/retry/download/caching/cache-busting/storage logic and only deal with the business concern of "give me this metadata for a package"?
It would be ideal to have a
sha1
togroupID, artifactID
for jars that do not havepom.xml
and are hosted on maven. This would help with the following issues:Keeping a dataset for a mapping of all of maven might be prohibitive, but we should investigate doing this for a subset of maven artifacts that do not have a
pom.xml
in the jar.