CycloneDX / specification

OWASP CycloneDX is a full-stack Bill of Materials (BOM) standard that provides advanced supply chain capabilities for cyber risk reduction. SBOM, SaaSBOM, HBOM, AI/ML-BOM, CBOM, OBOM, MBOM, VDR, and VEX
https://cyclonedx.org/
Apache License 2.0
359 stars 56 forks source link

externalReferences type for "source" packages #98

Closed gernot-h closed 8 months ago

gernot-h commented 2 years ago

Sorry if I overlooked something obvious, but I miss a way to specify a source archive url for a component, as logical counterpart to the distribution type.

Many ecosystems have the concept of a source and a somehow derived package. In Python's PyPI you have a "wheel" and a "source" package (check https://pypi.org/project/chardet/#files), for Linux packages there are binary and corresponding source packages (check https://packages.debian.org/buster/libgcc1) etc.

Deriving the correct "source" package for a component isn't always straight-forward, but important for many use-cases (for example for license clearing, for mapping source-level sec advisories to binary components etc.). So it would be very helpful to store them in a CycloneDX BOM in a canonical way. Therefore I suggest to add a source type for externalReferences.

Note that this is in most cases not equal to the "vcs" type (which is often some kind of upstream project) because many repositories provide an own source archive exactly reflecting what was used when building their "binary" packages.

Example:

      "name": "chardet",
      "version": "4.0.0",
      "externalReferences": [
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/19/c7/fa589626997dd07bd87d9269342ccb74b1720384a4d739a1872bd84fbe68/chardet-4.0.0-py2.py3-none-any.whl",
          "comment": "PyPI wheel file"
        },
        {
          "type": "source",
          "url": "https://files.pythonhosted.org/packages/ee/2d/9cdc2b527e127b4c9db64b86647d567985940ac3698eeabc7ffaccb4ea61/chardet-4.0.0.tar.gz",
          "comment": "PyPI source archive"
        },
        {
          "type": "vcs",
          "url": "https://github.com/chardet/chardet",
          "comment": "upstream repository"
        }
      ]
stevespringett commented 2 years ago

Distribution is intentionally not specific to binary, source, hybrid, or other. Multiple distributions can be specified for a component.

Take Maven for example. A single component may have multiple artifacts that are part of the distribution. https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.3.1/

In this case, there's artifacts for the:

It's not the intent to describe every possible artifact type for every ecosystem. I think if we start separating out the types of distributions, we'll create confusion as not all ecosystems are black and white (source and binary).

For ecosystems where the component is the source (e.g. Perl), there would be confusion about which type to use as both distribution and source could be equally relevant. Javascript (npm) could actually be a hybrid containing both source and binary depending on the package.

In the Python example provided, it's easy enough to identify which distribution is the wheel and which one is not. In the Maven example, Maven has naming conventions so simple pattern matching against the distributions will tell you what they are. Other ecosystems may not be as predictable.

@coderpatros, @DarthHater what are your thoughts?

gernot-h commented 2 years ago

Ah, I see, so for my example above I should just use this today:

      "name": "chardet",
      "version": "4.0.0",
      "externalReferences": [
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/19/c7/fa589626997dd07bd87d9269342ccb74b1720384a4d739a1872bd84fbe68/chardet-4.0.0-py2.py3-none-any.whl",
          "comment": "PyPI wheel file"
        },
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/ee/2d/9cdc2b527e127b4c9db64b86647d567985940ac3698eeabc7ffaccb4ea61/chardet-4.0.0.tar.gz",
          "comment": "PyPI source archive"
        },
        {
          "type": "vcs",
          "url": "https://github.com/chardet/chardet",
          "comment": "upstream repository"
        }
      ]

And it would be the task of the application to either do pattern matching in the URL to differentiate between package types or use other means like application specific comment conventions.

jkowalleck commented 1 year ago

@gernot-h is this still an open issue?

gernot-h commented 1 year ago

@gernot-h is this still an open issue?

Thanks for asking! Yes, definitely. Within Siemens AG, we created a kind of downstream specification extending and narrowing down CycloneDX (parts of it are public in https://github.com/siemens/cyclonedx-property-taxonomy). As a workaround, we specify defined comment fields:

grafik

We would highly appreciate if there would be some interoperable upstream solution for it, so BOM scanners can be extended to provide this information over time.

We btw also had a discussion whether a 2nd purl entry for stating source references might be needed as source urls are never unambiguous, but for now, we don't think it's a good idea.

jkowalleck commented 1 year ago

That VCS reference could point to the general VersionControlSystem of the project, while source could point to the actual source used for generating the component, which is not necessarily hosted in a VCS and is not intended to be distributed. But then there already is the idea of source distribution, which is a specific type of distribution, one that is intended to be used downstream.

Why would it be necessary to document the source of a component, if it was not distributed from source in the first place? I still do not understand. How I see this: If you had a SBOM for product A which has a component B, and B was built/assembled/compiled/generated/packed from some source, then this B should provide a BOM for itself, describing the build process. Providing these capabilities is the goal of #31. There is no need for readers of A's BOM to know how(from which source) B was built or claimed to be built. All that is needed to know is where B came from(distribution) and have some hashes for integrity-checks on B.

tsjensen commented 1 year ago

A VCS reference would not be sufficient even in cases where the source code is hosted in a public VCS, because we would want a reference to the sources for the particular version of the component, which is always a deep link. Example:

Determining this deep link to the correct sources can require specific knowledge of the source ecosystem. For example, it may be necessary to understand how Maven Central handles source archives, or what a Golang Proxy is.
Therefore, it would be great if the tool which has this knowledge (such as a CycloneDX scanner) could also record it in its output SBOM.

Currently, it can do so in an externalReferences section with type distribution:

"externalReferences": [
  {
    "type": "distribution",
    "url": "https://github.com/apache/commons-lang/archive/refs/tags/rel/commons-lang-3.12.0.zip",
    "comment": "source archive (download location)"
  }
]

While such an entry is correct, it is very difficult to consume. There can easily be multiple distribution entries - which one contains the source reference?
We currently work around this problem by using a defined comment string, but that is obviously a fragile construct which doesn't scale to partners and customers.

A type of source (or any other type which is clearly distinguished) would greatly improve our situation here.

gernot-h commented 1 year ago

Looks like this topic was already picked up as proposed enhancement, but let me still try to answer the question.

Why would it be necessary to document the source of a component, if it was not distributed from source in the first place? I still do not understand. How I see this: If you had a SBOM for product A which has a component B, and B was built/assembled/compiled/generated/packed from some source, then this B should provide a BOM for itself, describing the build process. Providing these capabilities is the goal of #31. There is no need for readers of A's BOM to know how(from which source) B was built or claimed to be built. All that is needed to know is where B came from(distribution) and have some hashes for integrity-checks on B.

For our team, this is a compliance as well as maintenance topic. Think about providing a Linux firmware image with several hundred packages based on a certain Linux distribution. Or think about providing a vendored NPM/Ruby... bundle as part of an application download or product.

Now you need to not only provide a "binary" SBOM for your customer, but you also need to check the licenses of all the contained components internally. And you might want to also mirror a snapshot of the used source packages internally in case you need to patch your product/app in 5 years from now. For all these topics, we need our BOMs to describe the sources which were used by a 3rd party to provide the binary packages we used. (For well-designed eco systems like Python or Debian, the 3rd party provides this information, but all in different ways you want to import in a common format to a central place.) And we don't want to generate several hundred derived BOMs to describe how each of the integrated components was built.

I'm no security guy, but according to https://github.com/anchore/syft/issues/1700#issuecomment-1491967306, having the source information for a given "binary image BOM" is also valuable in vulnerability matching. That's why they invented their own proprietry extension to include this information adding custom purl qualifiers like we did specifying Siemens-wide CycloneDX comment strings used for source links.

We think this is relevant for many distribution use cases and we should have a common solution to express this information.

jkowalleck commented 1 year ago

Thank you very much for your insights. Thought about the topic a lot, lately. Here is what i came up with

Distribution not only have a URL, but have other attributes, too:

There might be a lot of attributes related to a distribution, that might come in handy being documented. In case you are documenting distributions in a BOM, for me, it is most important to mark the one distribution that you actually used to build your product. I might not care about all the possible dists and sources, but I must know which one was actually used during build processes, so that I could reproduce and attest the build. Therefore, I would need a marker. (Would like to see an XML-constraint that allows only one of the distributions having this marker.)

Just some examples:

tsjensen commented 1 year ago

Don't overthink it though. I would only need one extra item in the list of possible types. That list was already extended from 16 values in 1.4 to 39 values in 1.5. Let's make it 40 values in 1.6 by adding:

I don't need to know any additional details. (Of course, then I won't be able to actually build the component given only the SBOM, but frankly, that will be a problem no matter how much metadata you encode into the SBOM.)

agschrei commented 1 year ago

I'm with @tsjensen on this. The latest spec revision already gives people plenty of options to choose from for specialized types of references. But the one that we are still missing for our needs is the reference to source code.

For us it is critical to not only have the information which specific distribution of a component is in use in an application, but also to reference the source it was generated from. This provenance information allows us to conduct additional analysis. For the scope of this analysis we do not need to have all the information to reproducibly build an artifact from source, a reference to the source itself is sufficient.

To provide a simple example: For a component describing a maven package I would expect a "distribution" reference describing the maven repository layout the artifact came from and a "source" link that points to the GitHub release, VCS commit snapshot or any other deep link to the code the artifact was built from. With the current options for the reference type we have no option to clearly express both without resorting to comments.

jkowalleck commented 9 months ago

we discussed this topic in our last core working group meeting. It is still considered for 1.6. We might use an alternative wording. Something along "source-distribution". CC @stevespringett @coderpatros @DarthHater @CycloneDX/core-team // https://github.com/CycloneDX/specification/pull/269#issuecomment-1845834248

jkowalleck commented 8 months ago

fixed via #269