Keep maven identification dataset for use by java cataloger

wagoodman commented 1 year ago

It would be ideal to have a sha1 to groupID, artifactID for jars that do not have pom.xml and are hosted on maven. This would help with the following issues:

[x] https://github.com/anchore/syft/issues/2152
[ ] (I think there are more...)

Keeping a dataset for a mapping of all of maven might be prohibitive, but we should investigate doing this for a subset of maven artifacts that do not have a pom.xml in the jar.

elifarley commented 12 months ago

Another example is org.springframework.security:spring-security-web:6.0.3: The fields bom-ref, purl and cpe are wrong, and the field group is missing it should be set to org.springframework.security):

    {
      "bom-ref": "pkg:maven/spring-security-web/spring-security-web@6.0.3?package-id=26ff02e1092a0036",
      "type": "library",
      "name": "spring-security-web",
      "version": "6.0.3",
      "purl": "pkg:maven/spring-security-web/spring-security-web@6.0.3",
      "cpe": "cpe:2.3:a:spring-security-web:spring-security-web:6.0.3:*:*:*:*:*:*:*",
      "externalReferences": [
        {
          "url": "",
          "hashes": [
            {
              "alg": "SHA-1",
              "content": "7a2c26fd8e1c0f6709fbb034f76d3ef50ea5b929"
            }
          ],
          "type": "build-meta"
        }
      ],
      "properties": [
        {
          "name": "syft:package:foundBy",
          "value": "java-cataloger"
        },

As I needed correct values so that the SBOM vulnerability scanner could correctly recognize the components, I ended up having to merge syft's SBOM file with Trivy's SBOM file, as Trivy is good at identifying Maven packages (but bad at OS-level packages for instance). Then it works great.

willmurphyscode commented 11 months ago

@wagoodman when syft is being used online, it might also be possible for us to compute a SHA of the artifact and use it to search Maven Central for the right group ID and artifact ID. (Happy to make a separate issue for this suggestion, but IMO it belongs here since the dataset is sort of an offline version of the maven central search, so they know about each other.)

wagoodman commented 8 months ago

Just a small update on options here, I've been able to take the maven index tooling and data and with some modifications output it to a sqlite DB comprised of group id, artifact id, version, and sha1 for each jar in maven central. This is ~3.5 GB of raw data, but normalized in the DB and compressed for distribution (tar.zst) is ~500MB. Though not great, this also isn't terrible. This could be trimmed down further to only include "problematic artifacts", or, those which do not appropriately have packaged the artifact ID and group ID... which means syft would positively leverage the sha1 lookup for identification.

wagoodman commented 8 months ago

One thing that I think comes due with being able to pick this up for work (beyond the operationalization of processing and hosting the data for syft to download) is some thoughts around how common patterns like this could be in other ecosystems. That is, we might be able to augment package identification for more catalogers if we had small "side-car" datasets, either local or remote, maybe even ad-hoc (request per-package). If so, could these patterns have a single facade for catalogers to not have to worry about the request/retry/download/caching/cache-busting/storage logic and only deal with the business concern of "give me this metadata for a package"?

anchore / syft

Keep maven identification dataset for use by java cataloger #2185