bazelbuild / bazel-gazelle

Gazelle is a Bazel build file generator for Bazel projects. It natively supports Go and protobuf, and it may be extended to support new languages and custom rule sets.
Apache License 2.0
1.18k stars 373 forks source link

go_repository should emit version information for auditing purposes #1238

Open sluongng opened 2 years ago

sluongng commented 2 years ago

Background

Software supply chain security has been a growing pain in recent years.

Typically, an organization would want to have the ability to scan through all of their external dependencies and check for malicious versions where possible.

These information can then be fed through a security scanner to validate that there is no malicious dependency. Furthermore, they can be used to build up a Software Bill of Materials (SBOM) for downstream consumption.

Problem

Right now, go_repository does not emit any clue regarding when / where the repo came from once download. This make it hard for folks to consume the dependencies version programmatically and build automations and tests based on them.

Today, there are some alternative approaches that would give you close:

Bazel query

bazel query 'kind("go_repository rule", //external:*)' --output=xml

Bazel experimental_repository_resolved_file

bazel sync --experimental_repository_resolved_file=temp.bzl

Both of these approaches can be too verbose and hard to consumed programmatically as part of Bazel build graph.

Propose solution

Provide option create_version_file in go_repository, default to false. When this option is true, go_repository should create a GO_REPOSITORY_VERSION.bzl file at the root level of the repository.

This version file should follow this template:

NAME=<repository-name>

URL=<download-url>

COMMIT=<commit-sha>
TAG=<tag-name>

VERSION=<version>

SUM=<check-sum>

The file must have either one of url and commit/tag and version value provided according to how go_repository target was downloaded.

This file can then be consumed via various macros/aspect rules to test and generate SBOM for each of the go_binary releases.

achew22 commented 2 years ago

Presumably this could be made to work with the new in Go 1.18 debug/buildinfo package. Is this a step in that direction? I think we will need to create some kind of compile time value to inject that meets this specification https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/runtime/debug/mod_test.go;l=26, but haven't looked into it really at all.

sluongng commented 2 years ago

@achew22 no, buildinfo is a compile time build stamping and can be used to identify which version of the current workspace I am building from. I am trying to identify which versions of all the dependencies that the go_binary is built with.

I think for this issue, I am looking for something that can be run/extract without needing to stamp the binary.

Let's say I have an awesome popular CLI named foo-cli. For each releases, I would want to release the pre-comiled binary of this CLI for each os/architecture, the sha256sum(s) of the binaries.

With the new standards, some downstream consumer of this CLI would ask: hey, what dependencies are you using to build this CLI? Are they secured? Now I would want to create an SBOM attached to the release to assure the vendor that all deps are secured.

So the process of building this SBOM could either be embedded into the binary, or generated parallele with the binary. My current plan is that for each go_repository repo target, generate a @go_package_repository//:version target. Then create a custom go_binary wrapper macro as follow:

def my_go_binary(name, deps, kwargs...):
    version_files = extract_version_files(deps)
    sbom(
        name = name + "_sbom",
        deps = extract_version_files(deps),
    )
    go_binary(
        name = name,
        embedsrcs = [name+"_sbom"],
    )

Now with the release of the binary, I can attach sbom file within the release artifacts. Additionally, the user may have an option of running foo-cli sbom to get the sbom content on the fly.

achew22 commented 2 years ago

If you look at the definition of debug.BuildInfo it has a field

Deps      []*Module      // Module dependencies

which holds all the module dependencies. The definition of Module is:

type Module struct {
    Path    string  // module path
    Version string  // module version
    Sum     string  // checksum
    Replace *Module // replaced by this module
}

which I believe contains all of the information you're describing, and is the standard to the go language way of describing compiled dependencies in a programmatically accessible way. There are already a number of tools that take this data and generate SBOMs from them. I'm pretty reluctant to deviate too far from go's standard tooling in order to achieve this.

Possibly I'm misunderstanding something. Can you help me understand it a little better?

sluongng commented 2 years ago

Oh wow i totally misread that. Yes you are totally right that we can collect info and stamp that struct

sluongng commented 2 years ago

Thinking a bit more about this topic: it seems like we would need to implement this in 4 stages:

  1. go_repository generate a file target which contains buildinfo of a certain module
  2. modify go_library to optionally include dependency on the buildinfo file target of go_repository
  3. modify go_library to propagate BuildInfo Provider to downstream dependents
  4. create a rule which can be used with go_binary in a macro to consume BuildInfo Provider and build an SBOM
adam-azarchs commented 2 years ago

Using buildinfo is all well and good for a project that's exclusively using Go, but frankly if you're project is exclusively using go then why are you bothering with bazel? For multilingual projects, it's helpful to have a more general solution.

What we use internally for most things is bazel's license tracking system, for which we use the (documented but unimplemented in the public version of those rules) additional_info metadata to add version information. That works for all of our vendored dependencies, as well as everything with manually-added BUILD files (e.g. http_archive) or repository rules we control; go_repository is our main blind spot there. This admittedly veers a bit into the territory of "separate feature request," (and I'll open one for that) but as license compliance is one of the reasons why one needs an SBOM, they're not entirely unrelated. There's a lot of moving parts there, but I think it's worth considering this as a potential solution to the issue here.

sluongng commented 2 years ago

@adam-azarchs Thanks for pointing out that the new rules_license may be used / intended to generate SBOM in the future.

@aiuto I think you mentioned this in the latest Bazel community update? Is the intention to track release version / commits / git-tag of an upstream dependency under rules_licence? or is there a plan for additional rules in the near future?

aiuto commented 2 years ago

I don't understand your question. If you mean anything about rules_license holding a map of packages to licenses then no. All it holds is the canonical names of licenses and some tools to extract reports about what is in your binaries. Packages should make their own license assertions by pointing to license kinds defined in rules_license. Some users will trust those assertions. Other users will not blindly trust what the author says, and layer their own processes into the package import tool (go_repository, rules_jvm_external, bzlmod, ...) to assign the license kind.

sluongng commented 2 years ago

@aiuto I thought in https://youtu.be/gYrZDl7K9JM?t=1377 the goal of rules_license is eventually able to produce SBOM using bazel for compliance audit.

Inside the SBOM, one of the metadata needed is version of the dependency.

As @adam-azarchs mentioned above, using the additional_info attribute inside rules_licence, we could record the versioning of the dependency when we generate license target for them.

I was asking whether that usage is something that you would support as an official use case for rules_license as the goal is to create an SBOM. Or would you recommend using a separate set of rules to emit/propagate versioning metadata?

aiuto commented 2 years ago

The intent is that rules_license will eventually have

The data that goes into the SBOM, however, must come from each package itself, or be added to the BUILD files on importing a package. That is, rules_package will never become a central repository that maps, for example, an external repository to its license information.

As far as additional_info, I would rather see small set of well defined names instead of a sack of strings. The "version" of a package is a fine thing to add. It is just like the package_name and copyright_text. Just like those attributes, rules_package wouldn't be using it in any way except to hold in the provider and pass along to an SBOM report generator.

adam-azarchs commented 2 years ago

This is perhaps getting a little bit off-target from the original issue, but I would say

  1. License compliance is a strict subset of the goals for SBOM. It is perhaps a little unfair to ask rules_license to fill all of the needs for an SBOM. Amongst other things, the ways that licenses propagate through the build graph aren't the same as SBOM information; for example, build tools like compilers are part of SBOM but usually don't need to be considered for license compliance (or at least not in the same way).
  2. That said, there is a lot of overlap. Version and provenance information can be as necessary for verifying license compliance as for any other goal of SBOM.
  3. Bazel has applicable_licenses as a universal attribute, and default_applicable_licenses as a package-level attribute. This makes license rules an attractive target for attaching functionality like this, even if it doesn't quite match up with the original mission for those rules. I do think it's probably better to let a little scope creep happen there than to make further changes to bazel itself to support these use cases.

I agree that the data declared in those license rules needs to be controllable by the user of the package rather than the provider. That's one of a few reasons why I said in https://github.com/bazelbuild/bazel-gazelle/issues/1256 that selection of license kind and file should be up to the caller of the go_repository rule.

I also agree that a free-for-all "bag of strings" isn't ideal, particularly for use in public rules like go_repository. I still think additional_info is useful for allowing organizations to customize rules_license to inject whatever metadata they need without making invasive changes to the rules, but it also does seem to me like version at least would be sufficiently universally applicable to warrant hoisting up into an explicit provider field.

Provenance is also important for both license tracking and SBOM, but is a little bit trickier than version, since there are a lot of ways one might want to describe that. For e.g. npm or cargo packages for example, there's metadata which claims to tell you where the code came from but no system to enforce that. But this isn't about those ecosystems. In the case of go_repository, it's almost always a git (or equivalent) repository, and that's something that's enforced by the build system.

sluongng commented 2 years ago

I think we are getting a bit closer to a concrete solution here. I will attempt to write an RFC below


RFC: Emitting licensing and versioning metadata for go_repository targets

Background

go_repository internal

Before going into actual solution, I think it's worth to remind folks a bit regarding how go_repository repository rule is working today.

Given the following target:

    go_repository(
        name = "org_golang_x_crypto",
        importpath = "golang.org/x/crypto",
        sum = "h1:kUhD7nTDoI3fVd9G4ORWrbV5NY0liEs/Jg2pv5f+bBA=",
        version = "v0.0.0-20220411220226-7b82a4e95df4",
    )

Upon loading-analysis, go_repository will do several things:

  1. Build go_repository_tools using go install and existing gazelle source code
  2. Extract fetch_repo and gazelle binaries
  3. Fetch the archive. Depends on which attribute were set, url or commit or tag or version, the archive will be downloaded differently: a. If url was set, we use bazel's native http_archive download b. If commit or tag were set, we use fetch_repo in vcs mode c. If version was set, we use fetch_repo in go module mode
  4. Run gazelle over the downloaded+extracted archive to generate BUILD files for the repository with rules_go targets
  5. Apply patch files (if there is any)

Proposed Solution

So I think we are leaning toward a solution where in (4), on top of generating targets for go_library and go_binary and go_test, we would want to add to BUILD file of each go_repository a license() target.

This can be accomplished by writing an additional gazelle language extension specifically for a new license-language. This extension should help detecting LICENSE files in different directories and generate license() targets in the BUILD file of said directory/package accordingly.

The source of this extension should be included in bazel-gazelle repository and is built during step (1). When (4) is run, we can optionally passed down flags to gazelle binary to (a) enable (disable by default) generating license() target and (b) toggle adding version and/or checksum information into license() target.

Currently, license() only support carrying licensing related metadata with option to use additional_info, a string-string map, as free-for-all metadata holder. We can temporarily add version and/or checksum information into additional_info for now and later migrate to a version attribute that rules_license could be adding in the future.

Out of scope


☝️ WDYT?

@adam-azarchs is this overlapping with what you proposed in #1256 ?

adam-azarchs commented 2 years ago

Yes, this is mostly overlapping what I proposed in #1256. Some important notes:

  1. I do think it's critical to allow users to specify license_kinds manually, because (as it says all over the design doc for rules_license) the expectation is that most organizations will be forking and customizing rules_license in various ways, and because automatic license parsing is full of potential problems.
  2. While the design doc for rules_license mentions additional_info, the public repo for it doesn't actually have that implemented, yet. See
  3. A clarification on

    This extension should help detecting LICENSE files in different directories and generate license() targets in the BUILD file of said directory/package accordingly.

    As an implementation note, in most repositories, there is a single LICENSE file in the repository root which would apply to all packages within the repository. So, subpackages should in general need to refer to the license target in the nearest parent directory (usually the repo root).

aiuto commented 2 years ago

w.r.t. the point about a single LICENSE which multiple build files point to. Yes. That is a common idiom. I have been leaning towards having a tool fix up all the BUILD files somewhere after downloading an archive and before finalizing it as an external repository or checked in code. I started a discussion section for that in OSS Licenses and Bazel Dependency Management

I would rather discuss the requirements in that doc instead of individual repositories.

sluongng commented 2 years ago

In one of the prior art, rules_jvm_external have repository rules creating java_import targets in external (maven) packages. Then each java_import will have version information recorded in tags attribute under a certain convention.

https://github.com/bazelbuild/rules_jvm_external/blob/269121e701479d7200c0f10536bdcbbef989c0c1/private/dependency_tree_parser.bzl#L231-L249

So another alternative here is to generate all go_library and go_binary targets inside go_repository with a specific tag that contains that package's version information inside tags attribute.

Then we can use similar approach like VMWare folk's: using aspect to traverse from the go_binary target to generate SBOM for each binary. 🤔

https://github.com/vmware/rules_oss_audit/blob/1b2690cefd5a960c181e0d89bf3c076294a0e6f4/oss_audit/java/oss_audit.bzl#L275

The benefit of this compare to the above proposal is that we can record version information in all bazel packages (parent and sub directories) inside a go_repository.

aiuto commented 2 years ago

cc: @wcn3 FYI