anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.19k stars 571 forks source link

Syft cycloneDX: create sBOM data from source packages instead of binary packages (e.g. debian packages) ? #1700

Open ericbl opened 1 year ago

ericbl commented 1 year ago

tl;dnr: could syft offer an option to generare a cycloneDX sBOM for os packages by considering only the source (put in the upstream part, also in metadata:source and version from metadata:sourceVersion on some package manager) and not the binaries?

Hello, let's start with the business background: every software delivered by our company need a proper clearance of open source software (OSS clearing). each team must generate a sBOM and get all software component analyzed on the shared SW360 platform: components must be properly identified and the source code provided. A dedicated team will go through the source code to check the licenses.

Each team can use the tool of its choice to create the component on sw360. Some even take the path of doing it manually. In our team, we create software that will be eventually deployed as a container image (docker for now): we use debian bullseye slim as base image and our software can further packages either built from source, or from some package manager (debian, pip, npm, nuget) or depending of the language (go, python, nodejs, ruby, c#, etc) Therefore, in my team, we want to use Syft to generate a CycloneDX BOM and eventually tranform it to get the components uploaded in our sw360.

Syft is already providing the list of licenses but this is unfortunately not considered (yet) in our process.

Considering debian packages, the internal team dealing with debian OS (let's call it DebT) insists of using only the source package and not the binary. DebT start with the list of debian components with this command: dpkg-query -f '${source:Package}|${source:Version}|${binary:Package}|${Version}\n' -W

DebT eventually only take ${source:Package}|${source:Version}

Currently, the syft command is however generating a cycloneDX bom based on the binaries. Source is sometimes set as metadata property and then attached to the upstream part in the purl. It is particularly true for libraries, generating duplicates component of the not lib variant (e.g. curl and libcurl both pointing to the same source) I've seen this upstream= addition only for debian packages, not yet on other package providers.

This however create a purl with this upstream extension not defined in the standard: https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#deb

Let's take a real exemple This is one line from the dpkg-query above.

util-linux|2.36.1-8+deb11u1|bsdutils|1:2.36.1-8+deb11u1

DebT is only interessted by scanning the source file so consider this package as name: util-linux version: 2.36.1-8+deb11u1

Syft generates the following in the cycloneDX sBON:

  "bom-ref": "pkg:deb/debian/bsdutils@1:2.36.1-8+deb11u1?arch=amd64&upstream=util-linux%402.36.1-8+deb11u1&distro=debian-11&package-id=677e6ace24dce684",
  "type": "library",
  "publisher": "util-linux packagers <util-linux@packages.debian.org>",
  "name": "bsdutils",
  "version": "1:2.36.1-8+deb11u1",
  "cpe": "cpe:2.3:a:bsdutils:bsdutils:1\\:2.36.1-8\\+deb11u1:*:*:*:*:*:*:*",
  "purl": "pkg:deb/debian/bsdutils@1:2.36.1-8+deb11u1?arch=amd64&upstream=util-linux%402.36.1-8+deb11u1&distro=debian-11",
  "properties": [
    {
      "name": "syft:metadata:installedSize",
      "value": "394"
    },
    {
      "name": "syft:metadata:source",
      "value": "util-linux"
    },
    {
      "name": "syft:metadata:sourceVersion",
      "value": "2.36.1-8+deb11u1"
    }
  ]

Source and sourceVersion are set as properties, as well as in the upstream part. For us, the correct package data would be

  "name": "util-linux",
  "version": "2.36.1-8+deb11u1",
  "purl": "pkg:deb/debian/util-linux@2.36.1-8+deb11u1&arch=source"

(according to purl spec, arch should be set as source when we speak about the source package)

We are working on our own transformation from the syft output, but I wonder if this could better be an special output from syft directly. What do you think?

gernot-h commented 1 year ago

I would be very interested in the background of the upstream qualifier. I couldn't find much about it besides the initial PR #769. Is this somehow aligned with other sBOM scanning tools, CycloneDX team etc? I guess this was a workaround for the restriction of CycloneDX to one package-url, right?

Note that I once requested CycloneDX support for specifying source information via externalReferences, but additional URLs won't allow to specify source references in a unique way (think about mirrors, .zip vs. .tar.gz link etc.).

So we now prefer the arch=source qualifier in Debian purls these days, here's the background discussion: https://github.com/package-url/purl-spec/pull/57.

cc: @wagoodman

ericbl commented 1 year ago

the following jq query is doing the job AFTER the syft scan in all layers, transforming the "binary based" CycloneDX to a "source based" CycloneDX:

   jq '"syft:metadata:source" as $srcName | "syft:metadata:sourceVersion" as $srcVersion 
        |.components[] 
        |= . + ( ( 
          .version as $componentVersion 
          | .properties//[] 
          | from_entries 
          | select(has($srcName)) 
          | (.[$srcVersion]//$componentVersion) as $version 
          | .[$srcName] as $name 
          | { $name, $version, purl: "pkg:deb/debian/\($name)@\($version)?arch=source" } 
        ) // {}
      )' syft_cyclonedx_bom.json > dx_bom_src.json

Thanks to StackOverflow to help me implementing this query!

wagoodman commented 1 year ago

@ericbl -- that's one heck of a jq command! (adding mental note to work on my jq chops... ). Let me see if I can answer a few questions.

I would be very interested in the background of the upstream qualifier.

We hesitated on adding this for a long time, specifically because the upstream param is out-of-spec, as pointed out. Using the pURL in this way has been very useful from a vulnerability matching point-of-view in grype, probably for the same reason that your internal DebT wants the SBOM results oriented with a source purl instead of being aligned with the binary: for vulnerability matching the source package matters most, since vulnerabilities tend to be written against the source package and not downstream packages.

Syft supports multiple SBOM formats, and the goal is to allow for grype to interop with these SBOMs in a way where vulnerability matching will not differ just because you've decided to use a different SBOM format. We explored multiple options for both SPDX and CycloneDX to express a source package clearly for the purposes of vulnerability matching but also wanted to ensure that it was clear to the SBOM consumer that these source packages were not found to be installed. At the time the methods we explored couldn't check all the boxes (the boxes were roughly: a) be clear to the user what's being expressed, 2) be able to show what's installed vs upstream relationships, and 3) be interoperable with multiple formats).

Grype also supports being able to perform vulnerability matching when only specifying a pURL or set of pURLs. This, combined with the other efforts, made me lean towards adding an out-of-spec qualifier onto the pURL. upstream aligned nicely with multiple OS ecosystems that have these vulnerability matching requirements.

I've seen this upstream= addition only for debian packages, not yet on other package providers.

All OS catalogers tend to have this feature: https://github.com/search?q=repo%3Aanchore%2Fsyft%20PURLQualifierUpstream&type=code (alpm, apk, dep, rpm).

we now prefer the arch=source qualifier in Debian purls these days

Correct, no dispute here about the source qualifier 👍 I agree that using the source qualifier is the right thing to do when writing a pURL for a source package.

However, this did not fulfill the needs of what we're trying to convey, which is "here is the [binary] package we found, and this is the package which it came from (the source package)". A pURL representing the source package alone only answers half of what was needed, and providing multiple pURLs is confusing for something that should be used as an identity (so should be singular).

ericbl commented 1 year ago

thanks for your answer. ok so you prefer the current purl for the integration of Grype. I suggested special option to generate source image, not to change your default output :)

another way would be indeed with a 2nd purl, but as you pointed out, it shall not be named "purl" since that one should be unique. But we could name it differently!

I found the cycloneDX spec a bit unprecise of the discussion, I did not find any rule either "source purl" or "binary purl". Adding a 2nd purl in the cycloneDX spec could be an option...

ericbl commented 1 year ago

All OS catalogers tend to have this feature: https://github.com/search?q=repo%3Aanchore%2Fsyft%20PURLQualifierUpstream&type=code (alpm, apk, dep, rpm).

thanks, I tried only on debian, npm, python, etc, but not yet on other linux distrib. I ll do asap that with alpine / apk.

It means my jq command above is not correct and shall be even more complex with a regex to rebuild the purl!

gernot-h commented 1 year ago

I found the cycloneDX spec a bit unprecise of the discussion, I did not find any rule either "source purl" or "binary purl". Adding a 2nd purl in the cycloneDX spec could be an option...

@ericbl, a while back, I requested a similar topic with the CycloneDX team. It was not about a source purl, but adding a specific type for external source references. The CycloneDX team however claimed that it's not easy/possible to distinct between "source" and "binary" references throughout all ecosystems: https://github.com/CycloneDX/specification/issues/98. I guess the same arguments would apply on source purls, so I wouldn't expect this to happen soon...

Also taking the point of @wagoodman into consideration, that an SBOM should express what is "installed" in an image, a "source purl" would somehow be inconsistent in the default SBOM.

But still, the feature as requested by @ericbl here – adding a syft --upstream mode to produce a "upstream relationship SBOM", would be very helpful for us and I think it wouldn't be in contradiction with CycloneDX spec or the other goals of Syft.

ericbl commented 1 year ago

the extraction of the source purl differs from package manager. For instance with alpine, I just got this component (I removed some data irrelevant for current discussion)

{
     "bom-ref": "pkg:apk/alpine/busybox-binsh@1.35.0-r29?arch=x86_64&upstream=busybox&distro=alpine-3.17.2&package-id=256fc96b4a8c4da8",
      "type": "library",
      "publisher": "Sören Tempel <soeren+alpine@soeren-tempel.net>",
      "name": "busybox-binsh",
      "version": "1.35.0-r29",
      "purl": "pkg:apk/alpine/busybox-binsh@1.35.0-r29?arch=x86_64&upstream=busybox&distro=alpine-3.17.2",
      "externalReferences": [
        {
          "url": "https://busybox.net/",
          "type": "distribution"
        }
      ],
      "properties": [
        {
          "name": "syft:package:foundBy",
          "value": "apkdb-cataloger"
        },
        {
          "name": "syft:package:metadataType",
          "value": "ApkMetadata"
        },
        {
          "name": "syft:package:type",
          "value": "apk"
        },
        {
          "name": "syft:metadata:originPackage",
          "value": "busybox"
        },
        {
          "name": "syft:metadata:size",
          "value": "1547"
        }
      ]
}

so the upstream part is built from the "syft:metadata:originPackage" instead of from the "syft:metadata:source" with Debian.

This means my proposed jq command above is wrong: I should parse the purl on the upstream part, and not considering the metadata that differ from package manager.

Having a ' upstream' mode as proposed by Gernot would help us a lot and avoid getting crazy with jq :)

wagoodman commented 1 year ago

There are two paths forward:

These aren't mutually exclusive, so both in theory could be done, but I'm interested in hearing folks thoughts on which might be more useful (or if there are any other ideas here).

ericbl commented 1 year ago

your 2nd path proposal seems a bit more complex. And how could I then filter out the packages listing binary information I am not interested with?

ericbl commented 1 year ago

as I wrote above, my jq query is specific to Debian and difficult to maintain. Therefore, I replaced it in our pipeline with the following python script.

import argparse
import json
from packageurl import PackageURL

def transform_json(import_json, export_json):
    image_sbom = json.load(open(import_json))
    for comp in image_sbom['components']:
        if 'purl' in comp:
            purl = comp['purl']
            # extract the purl items.
            syft_purl = PackageURL.from_string(purl)
            if "upstream" in purl:
                # extract the upstream and then the name and version of the source package.
                # example: "purl": "pkg:deb/debian/bsdutils@1:2.36.1-8+deb11u1?arch=amd64&upstream=util-linux%402.36.1-8+deb11u1&distro=debian-11"
                upstream = syft_purl.qualifiers['upstream']
                if "@" in upstream:
                    name, version = upstream.split("@")
                else:
                    name, version = upstream, syft_purl.version
                # retrieve the distro
                if "distro" in purl:
                    distro = syft_purl.qualifiers['distro']
                    # build a source purl from the purl items.
                    src_purl = f"pkg:{syft_purl.type}/{syft_purl.namespace}/{name}@{version}?arch=source&distro={distro}"
                else:
                    src_purl = f"pkg:{syft_purl.type}/{syft_purl.namespace}/{name}@{version}?arch=source"
                # replace the purl
                comp['purl'] = src_purl
                # example: "purl": "pkg:deb/debian/util-linux@2.36.1-8+deb11u1?arch=source&distro=debian-11"
                # update component's name and version
                comp['name'] = name
                comp['version'] = version

    # write the output json
    with open(export_json, "w") as file:
        json.dump(image_sbom, file, ensure_ascii=False)  # unicode output

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--import_json", type=str,
                        help="cycloneDX json-file out of syft", default="syft-cyclone-dx_sbom.json")
    parser.add_argument("--export_json", type=str,
                        help="cycloneDX json-file out of current transformation", default="")
    args = parser.parse_args()

    transform_json(args.import_json, args.export_json)

My pipeline script is then:

    - /usr/local/bin/syft $SCAN_CONTAINER_IMAGE --scope all-layers -o cyclonedx-json=syft-cyclone-dx_sbom.json
    - python syft.transform_sbom-bin-purl-to-source-purl.py --import_json syft-cyclone-dx_sbom.json --export_json image_sbom.json
kzantow commented 1 year ago

There seem to be a couple paths forward here, although this isn't a priority at the moment we've promoted this to our backlog and we welcome pull requests and would be happy to help.