Add DB v6 schema - Githubissues

wagoodman commented 2 months ago

DB v6 is meant to cover several use cases (you can safely ignore this link). The high level goals are:

Make the distributed DB smaller
- Via removing necessary information: for instance, "namespace" is a hold over from earlier projects and is no longer strictly necessary.
- Via minimizing redundant information: in order to optimize the matching process, the v5 Vulnerability table is filled with redundant information (rows are duplicated for every unique fix-in found in upstream data).
Make adding new use cases to the DB easier (e.g. add EPSS or KEV data sources for grype to functionally use them without needing to make structural changes to grype)

The high level design is as follows:

Add a table-per-search-usecase. The bare minimum information needed to search records are in these tables, which are indexed. These tables allow you to identify a record ID. We'll refer to these tables as "Search" or "*Hanadle" tables.
Have a pseudo content addressable (single) blobs table that maps record ID to JSON blobs. When writing to the DB, these record IDs are 1:1 to JSON content digests, so that we cheaply don't write duplicate information. These digests should be stripped away before distribution, since they are not strictly needed at runtime. These JSON blobs represent the business elements (e.g. Vulnerability, Severity, AffectedPackage, etc). All payloads are self-describing with a named/version "kind" field used for deserialization concerns. We'll refer to these as "*Blob" models.
Have auxiliary tables related to search tables to aid in refinement / augmentation to searches.

Here are a list of the “Handle” tables to search against:

AffectedPackageHandles: search by package name within an ecosystem
AffectedCPEHandles: search by CPE fields
VulnerabilityHandles: search by vulnerability ID

These are related to two other auxiliary tables:

Packages: listing of all unique pairings of package name and ecosystem
OperatingSystems: listing of all unique tuples of distro name, major version, minor version (with optional codename)

Here's how they AffectedPackageHandle table would relate to auxiliary tables:

CREATE TABLE AffectedPackageHandle (
    id INTEGER PRIMARY KEY,
    vulnerability_id INTEGER NOT NULL,
    operating_system_id INTEGER,
    package_id INTEGER NOT NULL,
    blob_id INTEGER NOT NULL,
    FOREIGN KEY(vulnerability_id) REFERENCES VulnerabilityHandle(id),
    FOREIGN KEY(operating_system_id) REFERENCES OperatingSystem(id),
    FOREIGN KEY(package_id) REFERENCES Package(id),
    FOREIGN KEY(blob_id) REFERENCES Blob(id)
);

# supporting tables...

CREATE TABLE VulnerabilityHandle (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    blob_id INTEGER UNIQUE,
    FOREIGN KEY(blob_id) REFERENCES Blob(id)
);

CREATE TABLE Package (
    id INTEGER PRIMARY KEY,
    ecosystem TEXT NOT NULL,
    name TEXT NOT NULL,
    UNIQUE(name, ecosystem)
);

CREATE TABLE OperatingSystem (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    major_version TEXT NOT NULL,
    minor_version TEXT NOT NULL,
    codename TEXT,
    UNIQUE(name, major_version, minor_version)
);

And the rather simple Blobs table:

CREATE TABLE Blobs (
    id INTEGER PRIMARY KEY,
    value TEXT NOT NULL UNIQUE
);

Here's how you might craft a search for an affected package for a specific OS:

SELECT 
    affected.blob_id AS affected_blob_id,
    vuln.blob_id AS vulnerability_blob_id
FROM 
    AffectedPackageHandles affected
JOIN 
    OperatingSystems os ON aph.operating_system_id = os.id
JOIN 
    Packages pkg ON aph.package_id = p.id
JOIN 
    VulnerabilityHandles vuln ON aph.vulnerability_id = vh.id
WHERE 
    os.name = 'ubuntu' AND 
    os.major_version = '22' AND 
    os.minor_version = '4' AND 
    pkg.name = 'openssl';

At this point you can take these blob IDs and query the blob table for the JSON payload and deserialize. This has an advantage over the existing schemas: you can conditionally inflate DB objects based on what you need, not have to inflate entire records that you end up throwing away.

In v1-5 you'd need to craft the correct namespace, which was a bespoke string --this shifts this to relations per-record.

Eventually we’d like to add additional handle tables (out of scope for v6 though):

NotAffectedPackageHandles: enables negative matching
NotAffectedCPEHandles: enables negative matching
EpssHandles: for aux sorting of grype results at presentation https://github.com/anchore/grype/issues/1973
KevHandles: for aux sorting of grype results at presentation

The proposed blobs are as follows:

type VulnerabilityBlob struct {
    ID string `json:"id"` 

    // Name of the Vunnel provider (or sub processor responsible for data records from a single specific source, e.g. "ubuntu")
    ProviderName string `json:"provider"`

    // List of names, emails, or organizations who submitted the vulnerability
    Assigner []string `json:"assigner,omitempty"`

    // Description of the vulnerability as provided by the source
    Description string `json:"description"`

    // Date the vulnerability record was last modified
    ModifiedDate *time.Time `json:"modified,omitempty"`

    // Date the vulnerability record was first published
    PublishedDate *time.Time `json:"published,omitempty"`

    // Date the vulnerability record was withdrawn
    WithdrawnDate *time.Time `json:"withdrawn,omitempty"`

    // Conveys the current status of the vulnerability
    Status VulnerabilityStatus `json:"status"`

    // URLs to external resources that provide more information about the vulnerability
    References []struct {
        // External resource
        URL string `json:"url"`

        // Free-form organizational field to convey additional information about the reference
        Tags []string `json:"tags,omitempty"`
    } `json:"refs,omitempty"`

    // List of IDs of the same vulnerability in other databases, in the form of the ID field.
    Aliases []string `json:"aliases,omitempty"`

    // List of severity indications (quantitative or qualitative) for the vulnerability
    Severities []struct {
        // Describes the quantitative method used to determine the score, or indicates the qualitative value
        Scheme SeverityScheme `json:"scheme"`

        // Severity score (e.g., "7.5", "CVSS:4.0/AV:N/AC:L/AT:N/PR:H/UI:N/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N", or "high")
        Value string `json:"value"`

        // Name of the source of the severity score (e.g., "nvd@nist.gov" or "security-advisories@github.com")
        Source string `json:"source"`

        // Free-form organizational field to convey priority over other severities
        Rank int `json:"rank"`
    } `json:"severities,omitempty"`
}

type AffectedBlob struct {
    CVEs          []string `json:"cves"`
    RpmModularity string   `json:"rpm_modularity,omitempty"`
    PlatformCPEs  []string `json:"platform_cpes,omitempty"`
    Ranges        []struct {
        Version struct {
            // Type of version range, such as "semver", "rpm", "pypi", etc.
            Type string `json:"type"`

            // Allows for a version range expression, such as ">=1.0.0", "1.0.0", ">= 1.0, <2.0", etc.
            Constraint string `json:"constraint"`
        } `json:"version"`

        Fix *struct {
            Version string `json:"version"`
            State   string `json:"state"`
            Detail  *struct {
                GitCommit  string `json:"git_commit"`
                Timestamp  time.Time `json:"timestamp"`
                References []struct {
                    // External resource
                    URL string `json:"url"`

                    // Free-form organizational field to convey additional information about the reference
                    Tags []string `json:"tags,omitempty"`
                } `json:"references,omitempty"`
            } `json:"detail,omitempty"`
        } `json:"fix,omitempty"`
    } `json:"ranges,omitempty"`
}

Implied changes from this:

All Blob entities… a. may have fields added to them over time, but not removed or modified (within the same schema version) b. only one “version” of a blob may exist within a DB at a time
When you have additional data that fits new use cases, add a new Handle table that points to new Blob types.
Remove the concept of a “Namespace” from the DB. All search data should be searched with the proper joins to auxiliary tables (or not) depending on what you’re searching for. In this way you are not searching within a namespace string pattern, you enumerate associated qualities you want to filter on.

Specific changes (see prototype models for reference):

[x] Add VulnerabilityHandle gorm model #2243
[x] Add OperatingSystem gorm model #2245
[x] Add Package gorm model #2245
[x] Add AffectedPackageHandle gorm model #2245
[x] Add Provider gorm model #2232
[x] Add DbMetadata gorm model #2146
[x] Add Blob gorm model #2243

TimBrown1611 commented 2 months ago

hi @wagoodman ,

are we expecting the scans results to change from v5 to v6?
will the db be stateful (can track what have changed \ added) - meaning, in case CVE was updated in a source which doesn't has a field of published \ updated, will it be mentioned in the DB?
flexibility to add new data - how easy it will be to add new tables (for example, EOL)?
will it effect grype's schema (v6 = grype 1.0.0)? thanks for your time!

wagoodman commented 1 month ago

are we expecting the scans results to change from v5 to v6?

No, there is no functional change in the match results as currently planned out. There should be the same match results. This could change if we fix any matcher bugs while we're rewriting that portion of the search package though. But we would not merge changes that made results worse, only better.

will the db be stateful?

It will be more stateful than it is today. Most information will be in the Blobs table and be content addressable. We wont be keeping the value digests in the DB to save on space, but it would be possible to digest all values in that table and compare digests computed from one DB and another DB to get a rough sense on the changes. We cannot use the blob IDs themselves sense they cannot be considered stable between DB builds.

how easy it will be to add new tables?

Since there would be new Handle tables for searching and new Blob schemas for entries that are searchable by a handle table, at any point during the v6 lifetime we can add these new elements without it being a breaking change or a paradigm shift. Take your example for EOL, I imagine there would be a new EolBlob that represents a json payload with cpe, purl, version, date, etc... as well as a new EolHandle table that is indexed by cpe or purl. It might be that we add new entries into the Packages table for joining on too (optionally, depending on what kind of data we pull in from that dataset).

will it effect grype's schema (v6 = grype 1.0.0)?

grype is still in v0 so we're making breaking changes on minor releases still. However, v6 schema is a big step towards grype 1.0. We still need to determine if we want to make the DB schema part of the grype version contract (e.g. should we allow for breaking schema changes and not need to bump the major version of grype?) -- that still hasn't been decided.

anchore / grype

Add DB v6 schema #2128