Add evidence used to determine inclusion of a component.

JorisVanEijden commented 2 years ago

Most SBOM generators base the inclusion of a component in an SBOM on a packagemanager file or the existence of some other file. I would like to be able to trace back what source was used as evidence for the inclusion of a component. The component.evidence property seemed like a good fit, but that only supports "licenses" and "copyright".

Could we add a "files" list to that? Then an SBOM generator can list the file(s) it used to decide to include the component there,

sambhav commented 2 years ago

+1 This would be really useful for cyclonedx support in syft as well. syft currently stores the file "evidence" in its internal model. If we could add this in cyclonedx, it would improve the traceability of SBOM outputs a lot in syft :)

cc: @wagoodman

stevespringett commented 2 years ago

In typical CDX fashion, we will try to target v1.5 of a Q1 2023 release. Therefore, there's some time to flush this out and support as many use cases as possible - keeping in mind that we try to focus on simplicity, high-degrees of automation, technology-agnostic, all without making the spec overly large.

With that in mind, are there use cases we can start documenting?

cc: @brianf

JorisVanEijden commented 2 years ago

The main question it needs to answer is: "Why is this component included in this SBOM?"

My actual (way too frequent) scenario:

DependencyTrack says project X has a critical vulnerability in component Y.
Project X says "we don't use component Y"
I say "then why is it in your SBOM?"
They say "I have no idea"

This could be solved with a simple text field saying "/themes/custom/wow/package-lock.json" or "JvE: I statically linked this with the executable", or "Syft v3.2.1 detected this as a Composer dependency in /app/dir/composer.lock".

Or something more complex with different types ("manual", "scanner", "generator", "other") and id's ("name: Steve Jackson" or "name: Cdxgen, version: 1.3.4") and more types ("filepath", "explanation", "url") etc.

For me both would work just as well but I see no use case for the more complex structure.

brianf commented 2 years ago

We provide information in our tooling that we call "occurrences" which would include a list of the file paths where the component was detected and the binary fingerprint that this thing matched. This helps people understand embedded cases or even when we detect a similar match or when something was renamed (eg foo.jar is actually log4j)

stevespringett commented 2 years ago

Thanks @brianf. Few questions...

Is the binary fingerprint reproducible from external tools? If so, can you provide a pointer on where we can find more information?
Would it be useful to capture the fingerprint and the tool that generated the fingerprint for every occurrence?
What else would be useful to capture?

planetlevel commented 2 years ago

Some evidence we would like to provide related to measuring at runtime

we found the right libraries -- libraries that are not in the code repo (appserver, runtime platform)
we did not find wrong libraries -- libraries only used in build or test environments
hashes of libraries as loaded in production
libraries that are used/unused
number of classes/methods used in components
the actual classes/methods invoked and the callsites/traces capturing how they are invoked

We could provide callsites/traces showing exactly how libraries are used by applications, but it is a LOT of data. It would be useful for folks trying to verify whether a vulnerable method (like the JDNI lookup in log4j) is actually used. Also useful when attempting to remove a component from an application.

mrutkows commented 2 years ago

Want to highlight that evidence is always associated with a "tool" (loosely) used against 1 or more components or services (or hardware) to evaluate some security or compliance use case; therefore, the schema needs a robust means to associate:

evidence/provenance/other (inputs/outputs/state) -- tool -- hardware/component/service (resource under inspection)
at various stages of a "build" (read CI/CD for "build") against what we are discussing as a "formulation".

First, please be aware that the Sigstore project and its Rekor (https://github.com/sigstore/rekor) "transparency log" has structured records for all types of "attestations" (evidence) around CI/CD including, OIDC attestations (IDs that are used for "build"), changes in MF auth. of an identity, records to attest that package manager builds have been signed/verified (near-term plans to be used by NPM, Ruby, Java, etc.), records of Cert. generations for ephemeral keys (per-Ci build, from SPIFFE/SPIRE), and more... In all these cases a ref. to the log entry (id) and format would be quite valuable for downstream tooling.

See TUF formats (https://github.com/theupdateframework) where many of these records are being standardized and "future proofed".
See OpenSSF Frsca project which is looking to produce standardized evidence that can map their "controls" to SLSA (and have recently been discussing OSCAL as the canonical mapping)

In general, the types of "evidence" (I will use this term loosely) we actually have today (note ALL relate to "tools" that produce it) which is being produced by Tekton (SLSA compliant) CI systems and are being stored in crude "evidence lockers" include:

CI system "pipeline run" instance (evidence the pipeline was run with proper config/creds., no runtime/container mutations)
CI "task" (evidence task was invoked with proper configs (evidence the pipeline was run with proper config/creds., no runtime/container mutations)
Scorecard "checks" tests on scsm/project health/source provenance, etc. (and Gauge evidence of provenance and developer origin)
Component / Service graphing tools (e.g., GitBom) and decisions trees (assure nothing skipped for polyglot or for media/file types)
- Note: gitbom pruces a std. ADG graph format
SAST - (evidence of runtime env. perhaps even test matrix as supported in many CI systems) and static tests run (names, results)
DAST - (evidence of the staging environment creation/config along with all dynamic tests (names, results))
Fuzzing (evidence of that API/endpoint tests were run, e.g., RPC/HTTP GET/POST calls invoked)
license/copyright/legal scanners (evidence are regex/regex templates used to scan source for presence of known or suspect legal language)
- include spdx templates as evidence
fingerprinting: evidence of "genome" produced (and model which may vary per binary type)

In all cases, the configurations, parameters and env. vars. (snapshot) present at tool invocation are necessity for toward otentially achieving repro. builds using SBOM as a potentially viable means of capture

and more to come!

stevespringett commented 1 year ago

@brianf for occurrences, what types of data do you have? Is it only the paths or is there other data?

planetlevel commented 1 year ago

IAST (evidence from complete running app/API stack, including exactly which libraries, classes, and methods are loaded and run. Also evidence of vulnerability testing, all exposed routes, all backend connections (services))

brianf commented 1 year ago

@brianf for occurrences, what types of data do you have? Is it only the paths or is there other data?

Our occurrences are file paths. This way if someone questions a finding, say it’s embedded inside another component, we can provide the exact path (sometimes a bang path) to where they can see the component in the scan path (workspace, CI build, application zip etc)

brianf commented 1 year ago

Thanks @brianf. Few questions...

Is the binary fingerprint reproducible from external tools? If so, can you provide a pointer on where we can find more information?

Some of the binary fingerprints are sha-x so yes. Similar match fingerprints for detecting a recompiled or slightly altered file are ultimately also sha fingerprints but of combinations of data that are ultimately proprietary. These wouldn’t typically appear in an SBOM output however, they’d be used internal to our tooling communication to figure out what a thing is.

Would it be useful to capture the fingerprint and the tool that generated the fingerprint for every occurrence?

If the fingerprints don’t exist elsewhere in the BOM, then yes I think that would be useful. This way tools that can go a level deeper and do binary matching analysis have more to validate, or even augment.

What else would be useful to capture?

stevespringett commented 1 year ago

Ok, trying to flush out a few ideas here... Bear with me... If I have log4j-core and I want to describe the evidence collected to determine that the library is indeed log4j-core, I might end up with the following information:

(note: this uses both an SCA and IAST example in one. Not sure if that would really be possible, but trying to illustrate both since we have reps from both on this ticket)

"components": [
  {
    "type": "library",
    "group": "org.apache.logging.log4j",
    "name": "log4j-core",
    "version": "2.14.0",
    "evidence": {
      "identity": [
        {
          "field": "group | name | version | purl | swid",
          "confidence": "0..1",
          "methods": [
            "source-code-analysis", 
            "binary-analysis", 
            "manifest-analysis", 
            "ast-fingerprint", 
            "instrumentation", 
            "dynamic-analysis", 
            "other" 
          ],
          "source": "where was the evidence found...",
          "name": "",
          "value": ""
        }
      ],
      "formulation": [
        {
          "ref": ""
        }  
      ],
      "occurrences": [
        "/path/to/log4j-core-2.14.0.jar",
      ],
      "callstack": {
        "frames": [
          {

            "package": "com.apache.logging.log4j.core",
            "module": "Logger.class",
            "function": "logMessage",
            "parameters": [
              "com.acme.HelloWorld", "Level.INFO", null, "Hello World"
            ],
            "line": 150,
            "column": 17,
            "fullFilename": "/path/to/log4j-core-2.14.0.jar!/org/apache/logging/log4j/core/Logger.class",
          },
          {
            "module": "HelloWorld.class",
            "function": "main",
            "line": 20,
            "column": 12,
            "fullFilename": "/path/to/HelloWorld.class",
          }
        ]
      }
    }
  }
]

planetlevel commented 1 year ago

I'm trying to understand this through the lens of the typical claim-evidence structure. Here we make some claims about the library identity (name, version, etc...).and I can imagine some evidence of that.... like we found a file with the name "log4j-core-2.14.0.jar" at this location on this host (low confidence). Or we calculated a hash from the bytes loaded at runtime, and matched that hash with a hash in xyz database (high confidence). Or we did some fingerprint thing that found a 98% match with log4j-core-2.14.0.jar from some binary repo (98% confidence this is a modified version of log4j).

The other claim here is that this library is actually used in production. You could provide static evidence of this - sometimes called reachability (low confidence) or instrumentation-based evidence (high confidence). I think providing the full stack trace of that interaction would be excellent evidence (but it's a LOT of data). Perhaps the parameters help... but there could be infinite variations of the parameters, so I guess you just report the first one? Would you do this for all the classes and methods in every library? Seems like a LOT of data for little payoff. For me, it would be strong enough evidence to simply report that a class from a particular library was observed to be loaded at a particular time by a tool that has the ability to observe that operation.

Contrast can capture all the classes that are actually used by the application. This data is very useful when trying to determine whether the vulnerable part of a library is actually in use. Personally, though, if a library has a vulnerability and any part of it is also used, I think the smart policy is to upgrade. This eliminates the 62% of libraries that are never used at all, and lets you focus on the libraries that are both vulnerable and actually used.

jkowalleck commented 1 year ago

stevespringett commented 1 year ago

@madpah Is there a difference between confidence and whether something was an exact match or not?

stevespringett commented 1 year ago

@planetlevel We're going to target reachability in CDX 1.7 with #103. In the mean time, we're planning on adding support for evidence of identity and the occurrences in which the component was found.

PR to come later this weekend.

"evidence": {
  "identity": {
    "field": "purl",
    "confidence": 1,
    "methods": [
      {
        "technique": "filename",
        "confidence": 0.1,
        "value": "log4j-core-2.20.0.jar"
      },
      {
        "technique": "ast-fingerprint",
        "confidence": 0.9,
        "value": "61e4bc08251761c3a73b606b9110a65899cb7d44f3b14c81ebc1e67c98e1d9ab"
      },
      {
        "technique": "hash-comparison",
        "confidence": 0.7,
        "value": "7c547a9d67cc7bc315c93b6e2ff8e4b6b41ae5be454ac249655ecb5ca2a85abf"
      }
    ],
    "tools": [
      "bom-ref-of-tool-that-performed-analysis"
    ]
  },
  "occurrences": [
    {
      "bom-ref": "d6bf237e-4e11-4713-9f62-56d18d5e2079",
      "location": "/path/to/component"
    },
    {
      "bom-ref": "b574d5d1-e3cf-4dcd-9ba5-f3507eb1b175",
      "location": "/another/path/to/component"
    }
  ]
}

@brianf thoughts on the above?

planetlevel commented 1 year ago

Wouldn't hash match be 1.0? Just want to make sure I'm not misunderstanding this.

planetlevel commented 1 year ago

@stevespringett - we're still including the option to include callstack evidence, right?

stevespringett commented 1 year ago

@planetlevel Would you like it included? If so, is the proposal adequate or does it need revision? If it's ok as is, I'll update the PR to include it.

stevespringett commented 1 year ago

The hash could match, but say its an MD5 or SHA1 with known colllision possibilities, the confidence may be less than one. if its a SHA256 or higher, then likely the confidence would be 1. But it's just an example above.

jkowalleck commented 1 year ago

can somebody please answer in short: why have an overall confidence and multiple specific confidences, but do not publish weights of specific confidence values? :mag: see https://github.com/CycloneDX/specification/pull/199#issuecomment-1488333109

planetlevel commented 1 year ago

@planetlevel Would you like it included? If so, is the proposal adequate or does it need revision? If it's ok as is, I'll update the PR to include it.

Yes, we should include it.

CycloneDX / specification

Add evidence used to determine inclusion of a component. #129