anchore / grype

A vulnerability scanner for container images and filesystems
Apache License 2.0
8.84k stars 574 forks source link

Add DB v6 schema #2128

Closed wagoodman closed 5 days ago

wagoodman commented 2 months ago

DB v6 is meant to cover several use cases (you can safely ignore this link). The high level goals are:

The high level design is as follows:

Here are a list of the “Handle” tables to search against:

These are related to two other auxiliary tables:

Here's how they AffectedPackageHandle table would relate to auxiliary tables:

CREATE TABLE AffectedPackageHandle (
    id INTEGER PRIMARY KEY,
    vulnerability_id INTEGER NOT NULL,
    operating_system_id INTEGER,
    package_id INTEGER NOT NULL,
    blob_id INTEGER NOT NULL,
    FOREIGN KEY(vulnerability_id) REFERENCES VulnerabilityHandle(id),
    FOREIGN KEY(operating_system_id) REFERENCES OperatingSystem(id),
    FOREIGN KEY(package_id) REFERENCES Package(id),
    FOREIGN KEY(blob_id) REFERENCES Blob(id)
);

# supporting tables...

CREATE TABLE VulnerabilityHandle (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    blob_id INTEGER UNIQUE,
    FOREIGN KEY(blob_id) REFERENCES Blob(id)
);

CREATE TABLE Package (
    id INTEGER PRIMARY KEY,
    ecosystem TEXT NOT NULL,
    name TEXT NOT NULL,
    UNIQUE(name, ecosystem)
);

CREATE TABLE OperatingSystem (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    major_version TEXT NOT NULL,
    minor_version TEXT NOT NULL,
    codename TEXT,
    UNIQUE(name, major_version, minor_version)
);

And the rather simple Blobs table:

CREATE TABLE Blobs (
    id INTEGER PRIMARY KEY,
    value TEXT NOT NULL UNIQUE
);

Here's how you might craft a search for an affected package for a specific OS:

SELECT 
    affected.blob_id AS affected_blob_id,
    vuln.blob_id AS vulnerability_blob_id
FROM 
    AffectedPackageHandles affected
JOIN 
    OperatingSystems os ON aph.operating_system_id = os.id
JOIN 
    Packages pkg ON aph.package_id = p.id
JOIN 
    VulnerabilityHandles vuln ON aph.vulnerability_id = vh.id
WHERE 
    os.name = 'ubuntu' AND 
    os.major_version = '22' AND 
    os.minor_version = '4' AND 
    pkg.name = 'openssl';

At this point you can take these blob IDs and query the blob table for the JSON payload and deserialize. This has an advantage over the existing schemas: you can conditionally inflate DB objects based on what you need, not have to inflate entire records that you end up throwing away.

In v1-5 you'd need to craft the correct namespace, which was a bespoke string --this shifts this to relations per-record.

Eventually we’d like to add additional handle tables (out of scope for v6 though):

The proposed blobs are as follows:

type VulnerabilityBlob struct {
    ID string `json:"id"` 

    // Name of the Vunnel provider (or sub processor responsible for data records from a single specific source, e.g. "ubuntu")
    ProviderName string `json:"provider"`

    // List of names, emails, or organizations who submitted the vulnerability
    Assigner []string `json:"assigner,omitempty"`

    // Description of the vulnerability as provided by the source
    Description string `json:"description"`

    // Date the vulnerability record was last modified
    ModifiedDate *time.Time `json:"modified,omitempty"`

    // Date the vulnerability record was first published
    PublishedDate *time.Time `json:"published,omitempty"`

    // Date the vulnerability record was withdrawn
    WithdrawnDate *time.Time `json:"withdrawn,omitempty"`

    // Conveys the current status of the vulnerability
    Status VulnerabilityStatus `json:"status"`

    // URLs to external resources that provide more information about the vulnerability
    References []struct {
        // External resource
        URL string `json:"url"`

        // Free-form organizational field to convey additional information about the reference
        Tags []string `json:"tags,omitempty"`
    } `json:"refs,omitempty"`

    // List of IDs of the same vulnerability in other databases, in the form of the ID field.
    Aliases []string `json:"aliases,omitempty"`

    // List of severity indications (quantitative or qualitative) for the vulnerability
    Severities []struct {
        // Describes the quantitative method used to determine the score, or indicates the qualitative value
        Scheme SeverityScheme `json:"scheme"`

        // Severity score (e.g., "7.5", "CVSS:4.0/AV:N/AC:L/AT:N/PR:H/UI:N/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N", or "high")
        Value string `json:"value"`

        // Name of the source of the severity score (e.g., "nvd@nist.gov" or "security-advisories@github.com")
        Source string `json:"source"`

        // Free-form organizational field to convey priority over other severities
        Rank int `json:"rank"`
    } `json:"severities,omitempty"`
}
type AffectedBlob struct {
    CVEs          []string `json:"cves"`
    RpmModularity string   `json:"rpm_modularity,omitempty"`
    PlatformCPEs  []string `json:"platform_cpes,omitempty"`
    Ranges        []struct {
        Version struct {
            // Type of version range, such as "semver", "rpm", "pypi", etc.
            Type string `json:"type"`

            // Allows for a version range expression, such as ">=1.0.0", "1.0.0", ">= 1.0, <2.0", etc.
            Constraint string `json:"constraint"`
        } `json:"version"`

        Fix *struct {
            Version string `json:"version"`
            State   string `json:"state"`
            Detail  *struct {
                GitCommit  string `json:"git_commit"`
                Timestamp  time.Time `json:"timestamp"`
                References []struct {
                    // External resource
                    URL string `json:"url"`

                    // Free-form organizational field to convey additional information about the reference
                    Tags []string `json:"tags,omitempty"`
                } `json:"references,omitempty"`
            } `json:"detail,omitempty"`
        } `json:"fix,omitempty"`
    } `json:"ranges,omitempty"`
}

Implied changes from this:

  1. All Blob entities… a. may have fields added to them over time, but not removed or modified (within the same schema version) b. only one “version” of a blob may exist within a DB at a time
  2. When you have additional data that fits new use cases, add a new Handle table that points to new Blob types.
  3. Remove the concept of a “Namespace” from the DB. All search data should be searched with the proper joins to auxiliary tables (or not) depending on what you’re searching for. In this way you are not searching within a namespace string pattern, you enumerate associated qualities you want to filter on.

Specific changes (see prototype models for reference):

TimBrown1611 commented 2 months ago

hi @wagoodman ,

wagoodman commented 1 month ago

are we expecting the scans results to change from v5 to v6?

No, there is no functional change in the match results as currently planned out. There should be the same match results. This could change if we fix any matcher bugs while we're rewriting that portion of the search package though. But we would not merge changes that made results worse, only better.

will the db be stateful?

It will be more stateful than it is today. Most information will be in the Blobs table and be content addressable. We wont be keeping the value digests in the DB to save on space, but it would be possible to digest all values in that table and compare digests computed from one DB and another DB to get a rough sense on the changes. We cannot use the blob IDs themselves sense they cannot be considered stable between DB builds.

how easy it will be to add new tables?

Since there would be new Handle tables for searching and new Blob schemas for entries that are searchable by a handle table, at any point during the v6 lifetime we can add these new elements without it being a breaking change or a paradigm shift. Take your example for EOL, I imagine there would be a new EolBlob that represents a json payload with cpe, purl, version, date, etc... as well as a new EolHandle table that is indexed by cpe or purl. It might be that we add new entries into the Packages table for joining on too (optionally, depending on what kind of data we pull in from that dataset).

will it effect grype's schema (v6 = grype 1.0.0)?

grype is still in v0 so we're making breaking changes on minor releases still. However, v6 schema is a big step towards grype 1.0. We still need to determine if we want to make the DB schema part of the grype version contract (e.g. should we allow for breaking schema changes and not need to bump the major version of grype?) -- that still hasn't been decided.