anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.15k stars 567 forks source link

Clarify group ID and artifact ID from maven central when pom is missing #3127

Open tetzla opened 2 months ago

tetzla commented 2 months ago

What happened: The component dom4j was relocated with version 2.0.0 from dom4j to org.dom4j:

Syft generates SBOMs with swapped group ids.

SBOM

Subsequent tools processing the SBOM have problems identifying the components correctly.

What you expected to happen: The group ids should to be corrected.

Steps to reproduce the issue: Create an SBOM for a bundle including dom4j 1.6.1 and dom4j 2.1.3.

Anything else we need to know?: -

Environment:

douglasclarke commented 1 month ago

I was looking at an issue with dom4j as well this week.

I believe the issue here is that there is no identifying metadata in MANIFEST.MF and there is no packaged pom metadata either. I believe the cataloger in this case just uses the base jar name for both the group and the artifact.

Without very specific dom4j rules in the cataloger the only reliable way to identify the specific jar might be with checksums and I am unclear how that would fit in the syft cataloger approach.

kzantow commented 1 month ago

I wonder if there was a way to identify these by classpaths or some sort of classpath signature.

We have also discussed having some sort of data feed for Syft in the past. I think it's a great idea -- we have a number of capabilities in Syft now that use the network to resolve information, like Maven poms. It seems like a logical leap to be able to provide Syft with some curated data, such as file hashes for identification of certain artifacts when there is little else to go by.

Having recently gone through a bit of instability with the Grype database downloads, there are some aspects we would need to give some thought to: how do we provide this information reliably to end users if it ends up being something we implement? Could we have a data set small enough to download a database every day? Probably not -- some experiments have been done that seem to indicate Maven central alone would end up being well over 1 GB compressed by itself and we would need a lot more for things like binary executables (I'd be interested to run Syft on every JAR in maven central and only include entries for ones that Syft misidentifies, I'm not sure anyone ran this experiment). We have managed to keep the Grype database reasonably small: under 200 MB, but don't think we would be able to have a similarly small data set for Syft, which means we would have to provide some "API". I think we could probably just have some static files using a well-known URL scheme that are easily cached by a CDN, and very small and probably keep data transfer to a minimum. I guess, the point here being that given the popularity of our tools, we can't just drop some files somewhere and expect them to solve everything without some planning at this point.

kzantow commented 1 week ago

Does anyone know if there are some specific public images we could use to reproduce this behavior?

wagoodman commented 1 week ago

Given both jars:

.
├── dom4j-1.6.1.jar
└── dom4j-2.1.3.jar

running syft against this directory yields:

{
  "id": "57073a041ff5db91",
  "name": "dom4j",
  "version": "1.6.1",
  "type": "java-archive",
  "foundBy": "java-archive-cataloger",
  "locations": [
    {
      "path": "/dom4j-1.6.1.jar",
      "accessPath": "/dom4j-1.6.1.jar",
      "annotations": {
        "evidence": "primary"
      }
    }
  ],
  "licenses": [],
  "language": "java",
  "cpes": [
    {
      "cpe": "cpe:2.3:a:metastuff-ltd-:dom4j:1.6.1:*:*:*:*:*:*:*",
      "source": "syft-generated"
    },
    {
      "cpe": "cpe:2.3:a:metastuff_ltd_:dom4j:1.6.1:*:*:*:*:*:*:*",
      "source": "syft-generated"
    },
    {
      "cpe": "cpe:2.3:a:org.dom4j:dom4j:1.6.1:*:*:*:*:*:*:*",
      "source": "syft-generated"
    },
    {
      "cpe": "cpe:2.3:a:dom4j:dom4j:1.6.1:*:*:*:*:*:*:*",
      "source": "syft-generated"
    }
  ],
  "purl": "pkg:maven/org.dom4j/dom4j@1.6.1",
  "metadataType": "java-archive",
  "metadata": {
    "virtualPath": "/dom4j-1.6.1.jar",
    "manifest": {
      "main": [
        {
          "key": "Manifest-Version",
          "value": "1.0"
        },
        {
          "key": "Ant-Version",
          "value": "Apache Ant 1.5.3"
        },
        {
          "key": "Created-By",
          "value": "Apache Maven"
        },
        {
          "key": "Built-By",
          "value": "Maarten"
        },
        {
          "key": "Package",
          "value": "org.dom4j"
        },
        {
          "key": "Build-Jdk",
          "value": "1.4.2_02"
        },
        {
          "key": "Extension-Name",
          "value": "dom4j"
        },
        {
          "key": "Specification-Title",
          "value": "dom4j : XML framework for Java"
        },
        {
          "key": "Specification-Vendor",
          "value": "MetaStuff Ltd."
        },
        {
          "key": "Implementation-Title",
          "value": "org.dom4j"
        },
        {
          "key": "Implementation-Vendor",
          "value": "MetaStuff Ltd."
        },
        {
          "key": "Implementation-Version",
          "value": "1.6.1"
        }
      ]
    },
    "digest": [
      {
        "algorithm": "sha1",
        "value": "5d3ccc056b6f056dbf0dddfdf43894b9065a8f94"
      }
    ]
  }
}
{
  "id": "c8663631d4b90329",
  "name": "dom4j",
  "version": "2.1.3",
  "type": "java-archive",
  "foundBy": "java-archive-cataloger",
  "locations": [
    {
      "path": "/dom4j-2.1.3.jar",
      "accessPath": "/dom4j-2.1.3.jar",
      "annotations": {
        "evidence": "primary"
      }
    }
  ],
  "licenses": [],
  "language": "java",
  "cpes": [
    {
      "cpe": "cpe:2.3:a:dom4j:dom4j:2.1.3:*:*:*:*:*:*:*",
      "source": "syft-generated"
    }
  ],
  "purl": "pkg:maven/dom4j/dom4j@2.1.3",
  "metadataType": "java-archive",
  "metadata": {
    "virtualPath": "/dom4j-2.1.3.jar",
    "manifest": {
      "main": [
        {
          "key": "Manifest-Version",
          "value": "1.0"
        }
      ]
    },
    "digest": [
      {
        "algorithm": "sha1",
        "value": "a75914155a9f5808963170ec20653668a2ffd2fd"
      }
    ]
  }
}

What is looks like is happening is:

From syft's perspective, which is to gather this information without an online lookup, this information is as accurate as it could be.

That being said, we've recently added online enrichment capabilities, and a search of the sha1 hash against maven could be one that we add (and is infact one of the example in #1115).

wagoodman commented 1 week ago

I'm going to repurpose this issue to really be about enhancing syft to be able to reach out to maven central with a sha1 digest to clarify any missing group ID and artifact ID.