computable open-source version information

CVEProject / cve-schema

This repository is used for the development of the CVE JSON record format. Releases of the CVE JSON record format will also be published here. This repository is managed by the CVE Quality Working Group.

Creative Commons Zero v1.0 Universal

245 stars 137 forks source link

computable open-source version information #87

Closed rsc closed 3 years ago

rsc commented 3 years ago

Background

The OSV schema has been adopted by Go, OSV, Python, Rust, and UVI to describe vulnerabilities in open-source software. The OSV schema’s key advantage over the CVE format is that it identifies the specific affected packages and versions in a precise, computable way.

For example, suppose we wanted to check whether a particular software package, as described by an SBOM, made use of any open-source components with known vulnerabilities. An SBOM for a given package ecosystem would be a list of its packages and versions. A tool can test whether each SBOM entry is affected by a database entry written to the OSV schema, without any additional information (such a version or commit graph or access to the repository containing the source code for the open-source software). This is what we mean when we say the package and version identification is computable.

We propose that the new CVE JSON schema be changed to make its package and version identification computable too. This would make it possible for vulnerability-checking tools to check SBOMs against the CVE database as easily as they can currently check SBOMs against OSV-schema databases. Adjusting the CVE JSON schema would also allow OSV-schema databases to embed their information into CVE format, allowing all their vulnerability information to be pushed upstream to the CVE database and then propagated to any CVE-aware software, a net benefit for the entire software ecosystem.

This issue focuses on computable version identification. See issue #86 for computable package identification.

Computable version identification

After identifying that a particular package listed in an SBOM matches a package in a CVE database entry (#NNN), a vulnerability scanner must next identify whether the specific version in the SBOM is considered affected by the CVE. The entry must include self-contained information sufficient to make this decision algorithmically. The current schema does not satisfy this requirement (or else it is unclear how it does).

What is the algorithm for deciding if a version is considered affected? The current spec does not provide details on how to evaluate the rules. At the start, it is unclear whether the “versions” list must be grouped by “versionGroup” before further processing, so we’ll suppose there is a single group in our examples. It was also unclear which logical operator to apply to the version entries. Issue #12 says that rules should be evaluated with AND, which makes it impossible to list individual versions. For example:

"versions": [
  {"versionAffected": "=", "versionValue": "1.0.0"},
  {"versionAffected": "=", "versionValue": "1.1.0"},
]

The explanation in #12 is that this means “version = 1.0.0 AND version = 1.1.0”, which doesn’t match any version at all.

According to the answer in #12, expressing multiple disjoint ranges of versions is also not possible. For example:

"versions": [
  {"versionAffected": ">=", "versionValue": "1.0.0"},
  {"versionAffected": "<", "versionValue": "1.2.0"},
  {"versionAffected": ">=", "versionValue": "1.5.0"},
  {"versionAffected": "<", "versionValue": "1.6.0"},
]

Here it seems clear the intended interpretation would be

(version >= 1.0.0 AND version < 1.2.0) OR (version >= 1.5.0 AND version < 1.6.0),

but there is no obvious way to encode this. Using ! operators would also not work. There is no boolean normal form with only one logical operator (that is, only AND, or only OR).

A second, related problem with the current schema is that even the definitions of operators like “>=” are not algorithmically precise. Clearly these are not string comparisons: 1.2.0 < 1.10.0. But neither are they simple element-wise comparisons: in packagers using Semver, 1.2.0 > 1.2.0-alpha. In Maven, even the alphabetic parts do not compare with strict regularity. In particular, this ordering applies:

"alpha" < "beta" < "milestone" < "rc" = "cr" < "snapshot" < "" = "final" = "ga" < "sp"

An operator like “>=” cannot be applied without reference to a particular version ordering algorithm, and the CVE schema omits that information.

The different operator variants are also confusing. For example, is there any difference between these two?

"versions": [
  {"versionAffected": ">=", "versionValue": "1.0.0"},
  {"versionAffected": "<", "versionValue": "1.2.0"},
]

"versions": [
  {"versionAffected": ">=", "versionValue": "1.0.0"},
  {"versionAffected": "!>=", "versionValue": "1.2.0"},
]

Or is this one any different from those two?

"versions": [
  {"versionAffected": ">=", "versionValue": "1.0.0"},
  {"versionAffected": "<", "versionValue": "1.2.0"},
  {"versionAffected": "!>=", "versionValue": "1.3.0"},
]

The result of “is this version affected?” should be a boolean yes/no, or at worst yes/no/maybe, but the current operators allow yes/no/maybe/undocumented, with no guidance as to what CVEs should do. Should tools treat “no” differently from “undocumented”? Is it a best practice to document all the negative ranges too? Why?

The CVE schema needs to address these deficiencies so that tools have clear algorithms for deciding whether a particular version is affected by a particular CVE.

OSV’s solution

The OSV schema addresses all these ambiguities as follows, which we suggest CVE adopt the basic ideas of. This is not the only possible solution but we believe it is a good one.

The OSV schema supports both an enumeration of specific affected versions and an enumeration of specific affected ranges. The set of affected versions is the OR of the entries in these lists - there is never an AND.

A range specifies a contiguous range of versions according to some defined version ordering. Today, those are “SEMVER” (preferred), “GIT”, and “ECOSYSTEM”. The “GIT” and “ECOSYSTEM” (meaning “packager-defined ordering”) range types are not directly understandable by general-purpose tools; such ranges are extra information understandable only by special-purpose tools. A particular entry is required to ensure that all affected versions are either listed in the explicit enumeration or in a Semver-type range, both of which can be processed by standard, packager-independent algorithms.

Each range is an object with three fields: type (the ordering), introduced, and fixed. The affected versions are those >= introduced and < fixed. If introduced or fixed are omitted, then that end of the range is left open.

For packagers that use Semver ordering, such as Go, NPM, and Rust, it suffices to specify only ranges:

"affects": {
  "ranges": [
    {"type": "SEMVER", "introduced": "1.0.0", "fixed": "1.14.14"},
    {"type": "SEMVER", "introduced": "1.15.0", "fixed": "1.15.17"}
  ]
}

For packagers that use other orderings, a packager-specific range can be listed, but the packager’s own vulnerability database tooling must “compile out” the range into an explicit list as well, for consumption by general-purpose tools, as in this Python example:

"affects": {
  "ranges": [
    {
      "type": "GIT",
      "repo": "https://github.com/pikepdf/pikepdf",
      "fixed": "3f38f73218e5e782fe411ccbb3b44a793c0b343a"
    },
    {
      "type": "ECOSYSTEM",
      "introduced": "2.8.0",
      "fixed": "2.10.0"
    }
  ],
  "versions": [
    "2.8.0", "2.8.0.post1", "2.8.0.post2", "2.9.0", "2.9.1", "2.9.2"
  ]
}

(The “GIT” range has an additional field “repo” to specify the URL of the source repository containing the given commits.)

The “versions” list specifies the same versions as in the “ECOSYSTEM” range, just in a more accessible way. General-purpose tooling would ignore the “GIT” and “ECOSYSTEM” ranges, relying instead on the “versions” list in this case.

Potential CVE adaptation

We propose to change the current version schema from:

"versions": [{
  "versionGroup": string,
  "versionValue": string,
  "versionAffected": string,
  "platforms": [string],
  "references" [...],
}],

to:

"versions": [{
  "list": [string],
  "range": {
    "type": string,  // semver, git, or packager
    "fixed": string,
    "introduced": string,
    "repo": string,  // for type git only
  },
  "unsure": bool,
  "platforms": [string],
  "references" [...],
}],

The only combining operator is OR, making the algorithm for matching much clearer. A particular version would be considered affected if it is matched by any of the entries in the overall “versions” object list. A version is matched by an entry if it appears directly in the “list” or if it is in the “range”. This structure allows non-standard ranges to include their version lists in the same object, which is an improvement over the OSV schema, and it allows a particular range or list to be qualified by a “platform” list as well.

The “unsure” entry allows a range or list to be marked as unsure, equivalent to using the current ?>= etc operators.

The current !>= etc operators are removed: to say that a version is unaffected, leave it unlisted.

chandanbn commented 3 years ago

+1 for GIT commit IDs as they help locate the vulnerable instances of code more precisely than versions.

Software changes represent a tree (directed acyclic graph) structure. Each commit results in new software - a node in this tree. Each fork results in a new branch. Some nodes get labeled as versions.

For any given node and any given vulnerability:

the node is affected (say colored red)
the node is not affected (say colored green)
it is not known if the node is affected or not (say colored gray)

It is possible that we have two more colors (but we can ignore them for now for simplicity)

likely affected (pale red)
likely fixed (pale green)

The problem we have: given a three-colored tree, we need to encode/serialize the graph so it captures this information as accurately, with less ambiguities, and allows easily determine if any given version is affected.

For eg.,

2.8 → 2.9 ---→ 2.10 → 2.11 → 2.12
           ↳  3.0 → 3.1 ----→ 3.2 → 3.3
                          ↳  4.0 → 4.1 → 4.2 → 4.3

Lets say 3.0 was branched off sometime before 2.10 was released, and 4.0 was branched off before 3.2 was released.

If the JSON was like:

"affects": {
  "ranges": [
    {"type": "SEMVER", "introduced": "2.10", "fixed": "4.3"},
  ]
}

Though it is easy to compute that 4.0 thru 4.2 are vulnerable, how do you determine the vulnerability status of 3.0 thru 3.3 and 2.8 thru 2.12 (except 2.10)? Does semver capture the information about how one linear branch is related to another?

oliverchang commented 3 years ago

Extending on that above example, this can be expressed like so:

Assuming that:

in 2.x, 2.9 introduced the vulnerability and a fix is available in 2.10
in 3.x, 3.0 inherited the vulnerability and a fix is available in 3.2
in 4.x, 4.0 inherited the vulnerability and a fix is available in 4.3

"affects": {
  "ranges": [
    {"type": "ECOSYSTEM", "introduced": "2.9", "fixed": "2.10"},
    {"type": "ECOSYSTEM", "introduced": "3.0", "fixed": "3.2"},
    {"type": "ECOSYSTEM", "introduced": "4.0", "fixed": "4.3"}, 
  ]
}

These conditions are evaluated with OR. A version is affected if it falls into any of those ranges. Using this we should be able to describe any set of ranges unambiguously. Describing ranges this way also makes it more easily understandable by human users if they want to know which versions they upgrade to if they're impacted.

rsc commented 3 years ago

Proposal

Replace "versions" in the current product object with these fields:

  "affectedVersions": [{
    "range": string,      // "semver", "git", "other"; optional, missing means not a range
    "version": string,    // specific version, or start of range; required
    "before": string,     // range ends just before this version; required when range is present
    "unspecified": bool,  // true if vulnerability status is unspecified (as opposed to asserted vulnerable); optional (default false)
    "repo": string,       // for type git (repository holding code); optional
  }],

  "testedVersions": [{
    "version": string,    // specific version; required
    "vulnerable": bool,   // required
  }],

  "platforms": [string],
  "references": [string],

Rationale

The discussion at the quality working group meeting brought up two important points:

Vendors may not want to commit to a specific “introduced” version. This is the reason for the ?, which I had tried to model as unsure. It's not really “unsure” so much as “undisclosed” or “unspecified.” (“We don't want to say.”)
Security researchers can only report results for specific versions that they have tested. In that context it is useful to say things like “1.1.0 is vulnerable; 1.5.0 is not.”

At the meeting, we spent a little while trying to figure out how to get all that into a single version object. Afterward, based on additional thought and discussion, @ochang and I propose that it would make sense to have two separate lists, with different uses and consumers.

First, there is the “affected versions” list, which is ideally an algorithmically precise description of an answer to the question “does version X contain this vulnerability?” Perhaps there are three possible answers — yes, no, perhaps — but it's still clear what the answer is for any given version X.

This list is consumed by programs that users run such as SBOM-based vulnerability scanners. In this list, if a version is not listed, the implication is “no, it is not affected.” That is, there is no need to enumerate all the unaffected versions.
```
"affectedVersions": [{
"range": string,      // "semver", "git", "other"; optional, missing means not a range
"version": string,    // specific version, or start of range; required
"before": string,     // range ends just before this version; required when range is present
"unspecified": bool,  // true if vulnerability status is unspecified (as opposed to asserted vulnerable); optional (default false)
"repo": string,       // for type git (repository holding code); optional
}],
```
A version object {"version": V} describes the single version V, for use when vendors need to enumerate the exact list of affected versions one at a time, such as when their version numbering isn't one of the known computable types.

Otherwise, a version object {"range": ..., "version": V, "before": W} describes the half-open range of versions v such that V ≤ v and v < W, according to the precise ordering defined by the "range" setting. (Note that W is not included.)

For a range, both "version" and "before" are required, but using the value "*" for "version" removes the lower bound, and using "*" for before removes the upper bound:
- {"range": "semver", "version": "*", "before": "1.2.3"} describes all versions before 1.2.3.
- {"range": "semver", "version": "1.2.3", "before": "*"} describes all versions 1.2.3 or later.
- {"range": "semver", "version": "*", "before": "*"} describes all versions ever.
Most projects will ignore unspecified, in which case listed versions are affected and unlisted ones are taken to be unaffected. When a vendor wants to cast doubt on a version without specifically identifying it as vulnerable, they can use

{"version": "1.2", "unspecified": true}

or

{"range": "semver", "version": "1.2.3", "before": "1.4.5", "unspecified": true}

Again, versions not explicitly listed are implicitly unaffected. There is no explicit "unaffected" status to cause confusion with the implicit status of not being listed.
Second, there is a separate “tested versions” list, for recording the results of security research.

This list is consumed only by security researchers and vendors, not programs. In this list, if a version is not listed, the implication is “there are no test results for this version.” That's a very different statement than the first list.
```
"testedVersions": [{
"version": string,    // specific version; required
"vulnerable": bool,   // required
}]
```
Here the version is explicitly deemed either vulnerable or not — there is no unspecified. Versions that have not been tested, or for which tests were inconclusive, are omitted from the list.

Another possible use of this second list would be automated systems that run proof-of-concept or other tests against specific versions. Such systems could record their results in the "testedVersions" list as a way to confirm (or refute) the claimed "affectedVersions".

These two separate lists seem to separate out two distinct use cases nicely, making it possible to serve both well with separate mechanisms, where before it seemed impossible to serve them well with a single mechanism.

At the working group meeting it sounded like there was consensus to move "platforms" out of the version list and into the outer product object. I have moved "references" out as well, since it seemed even more likely to be version-independent than "platforms".

[Edit: Streamlined the two objects a bit.]

rsc commented 3 years ago

I noted above that I moved platforms out, as discussed, and also references, since the same rationales seemed to apply. I think maybe we should move repo out as well. The code will come from a single repo that will not vary from version to version. So repo could go into the outer product object too.

chandanbn commented 3 years ago

testedVersions seems to be ok for making "not affected" assertions per single version. How do we encode a range of not affected?

Consider an example: a vuln in introduced in 2.12 fixed in 2.14, but due to some reason (like a mistake in resolving code conflicts) 2.16 and 2.17 are vulnerable again and it then gets fixed in 2.18. (Such things are rare but do happen).

.. 1.0 → 1.1 → 1.2 → 1.3 ... ... 2.10 → 2.11 → 2.12 → 2.13 → 2.14 → 2.15 → 2.16 → 2.17 → 2.18

( *= affected)

{"range": "semver", "version": "2.12", "before": "2.14"}
{"range": "semver", "version": "2.16", "before": "2.18"}

How do we affirmatively say 2.14 and 2.15 are not affected as a range? let's say 1.x was an unmaintained branch that was never evaluated. Since the bug was introduced in 2.12 the CNA wants to assert 1.x is unlikely to be affected.

Instead of testedVersions and unspecified can we do this with an optional rangeAffected ['affected' (default), 'unaffected', 'unspecified', 'likely', 'unlikely']?

{"range": "semver", "version": "1.0", "rangeAffected": "unlikely"}
{"range": "semver", "version": "2.12", "before": "2.14"}
{"range": "semver", "version": "2.14", "before": "2.16", "rangeAffected": "unaffected"}
{"range": "semver", "version": "2.16", "before": "2.18"}

(a consumer can consider likely to be same as affected for vulnerability management)

Another suggestion is to add a new value for range : "patch" - for products that use patching (like CVE-2017-4905 )

product: "ESXi"
versions:
 { "range": "patch", "version": "6.5", "before": "ESXi650-201703410-SG"}  
 { "range": "patch", "version": "6.0 U3", "before": "ESXi600-201703401-SG"}  
 { "range": "patch", "version": "6.0 U2", "before": "ESXi600-201703403-SG"} 
...

cganas commented 3 years ago

+1 for a rangeAffected type quantifier. This would allow the schema to simplify affectedVersions and testedVersions into a common versions property. This ensures parity in affected and unaffected version expression without duplication of the field.

This change would ultimately open up the version property for more additions in subsequent minor versions if desired, such as likely and unlikely, without strictly breaking compatibility.

tcullum-rh commented 3 years ago

Hi @rsc . Thanks for providing that proposal!

A few things that come to mind if we were to adopt this, from a naive perspective (on purpose):

1.) How would I specify a range if the vuln isn't fixed yet and/or plans to fix are not known? Currently, I'd be required to provide a version and a before, but what if I don't know what before is since it doesn't yet exist?

2.) The unspecified boolean is still a bit muddy to me. This doesn't necessarily mean that I feel that it doesn't belong, but I think if we go this route, it'll be very important to document exactly what the semantics are behind it and provide some example cases as well. Do you have some specific scenarios in mind with how this would be used? If I'm a vendor, and I say "We're registering CVE-0000-00000 for project X, versions 1.0 before 1.9 are affected" and "versions 2.0 before 2.7 are affected, unspecified," what am I saying and how is it useful?

3.) I understand the use-case for testedVersions as described and in my experience, it is very common for researchers to identify that they've tested just 1 version, get e.g. an ASAN dump and then make an upstream bug on an open source project as one example. We as a vendor don't immediately know all of which versions are affected from that info alone and the researcher is not making any assertions outside of the reported version.

I guess my only question is (and perhaps it's just a question of naming): Is it semantically sound to separate testedVersions from affectedVersions? Other than testing, how are claims for affectedVersions being made? Are we implying that affectedVersions have not been tested if they appear in one list and not the other? Again, I get the underlying use-case, but that's what I understand when I consume both lists. Is affectedVersions only to be used exclusively for reports from security researchers?

Put another way, if I'm a security engineer at a vendor, and I'm assigning a CVE for package foo; I've tested version 1.5 and found it to be vulnerable by actually reproducing the flaw, but I'm told by upstream that it affects all versions before 1.5 as well, would I set e.g. {"range": "semver", "version": 1.0, "before": 1.6}, in affectedVersions and testedVersions to {"version": "1.5", "vulnerable": true} because I actually tested/reproduced on 1.5 but I'm told it affects those prior versions as well, which I have not tested for whatever reason (such as not supported, not shipped, etc...?) What about platforms here? As in, researcher tested on platform X but vendor reports affected versions on platform Y? Just trying to confirm that I understand the usage.

Thanks again for doing this, and I'd re-iterate the importance of documenting the implications you mentioned such as:

Versions that have not been tested, or for which tests were inconclusive, are omitted from the list.

In this list, if a version is not listed, the implication is “no, it is not affected.” That is, there is no need to enumerate all the unaffected versions.

etc... Some areas of the schema are widely up to user interpretation for usage, and others it seems beneficial for the community to have some conformity on, so just want to ensure we make those areas well known, as this data is only useful when interpreted properly.

Lastly, I think you meant to tag @oliverchang ! :)

rsc commented 3 years ago

@cganas thanks for the feedback.

+1 for a rangeAffected type quantifier. This would allow the schema to simplify affectedVersions and testedVersions into a common versions property. This ensures parity in affected and unaffected version expression without duplication of the field.

A common versions property has the problem of not having a clear meaning for versions not explicitly listed. The two different use cases have two different natural semantics:

When a vendor issues a CVE, they presumably don't want to list every unaffected version. In that case, it makes sense to have unlisted = unaffected.
When a researcher issues a CVE, they are not claiming that unlisted versions are unaffected, only that their research has not shown them to be affected. In that case, it makes sense to have unlisted = untested.

Merging these different natural semantics into a single field makes the meaning of unlisted contradictory and unclear.

It seems like a significant step forward in clarity to separate the two uses.

rsc commented 3 years ago

@tcullum-rh thanks for the feedback

1) The idea was to use "before": "*" to explicitly indicate "there is no upper end to this range". And then if a fix was issued later you'd update the record, of course.

2) I am not at all attached to "unspecified" as the name for this middle-ground, nor do I really claim to understand the use case. What I thought I heard at the meeting was that a vendor wants a way to tell users "act as though these are vulnerable" without actually claiming (or admitting?) that they are in fact vulnerable. Suggestions welcome.

3) I agree with what happens in your scenario. Generally, I think the answers come down to what is consuming these fields. affectedVersions is for programs reporting vulnerabilities to users, doing automated ugprades, etc. In that case, the goal is to list the ranges for which action should be taken (along with perhaps the qualifier on the confidence that action really is needed, from 2). testedVersions is for researchers to document what they've tested and doesn't feed into the same automated systems. A security researcher at a vendor is probably focused on the first use case, in which it's probably enough to just list affectedVersions and not bother with testedVersions at all, even though some testing has been done of course.

Thanks for pointing out the username snafu. Sorry @oliverchang!

rsc commented 3 years ago

@chandanbn thanks for the feedback.

testedVersions seems to be ok for making "not affected" assertions per single version. How do we encode a range of not affected?

I think it would be fine to encode ranges there, although a researcher without access to the source code repo may have difficulty making such broad assertions. We could do it the same way as in the affectedVersions list.

Instead of testedVersions and unspecified can we do this with an optional rangeAffected ['affected' (default), 'unaffected', 'unspecified', 'likely', 'unlikely']?

This runs back into the issue I was trying to solve with the split, which I mentioned in https://github.com/CVEProject/cve-schema/issues/87#issuecomment-894647308 as well.

Specifically, once there is an explicit status that has the same meaning as not listing a version at all, then it becomes unclear whether you are supposed to list things explicitly or not. Consider:

affectedVersions: [{"version": "1.2.3"}]

testedVersions: [{"version": "1.2.3", "vulnerable": true}]

What does each say about 1.4.5?

The idea was that the affectedVersions line says (by not listing it) that 1.4.5 is unaffected, and similarly the testedVersions line says (by not listing it) that 1.4.5 is untested, which is a different statement.

If there is a single field, then unlisted can have only one meaning.

If unlisted means unaffected, then the security researcher has to write something like:

versions: [
    {"version": "*", "before": "1.2.3", "status": "untested"},
    {"version": "1.2.3", "status": "affected"},
    {"version": "1.2.4", "before": "*", "status": "untested"},
]

when all they really want to say is "there's a vulnerability in 1.2.3".

If unlisted means status unknown, then the vendor issuing instructions to users needs to write

versions: [
    {"version": "*", "before": "1.2.3", "status": "unaffected"},
    {"version": "1.2.3", "status": "affected"},
    {"version": "1.2.4", "before": "*", "status": "unaffected"},
]

when all they really want to say is "only 1.2.3 is affected".

It seems like inevitably people are going to write the 1-line version when they "should" be writing the 3-line versions.

The two different fields allow two different defaults, which should make the authoring of these more natural and less prone to error, as well as clearer in meaning.

chandanbn commented 3 years ago

IMHO we are solving two problems here:

Capture information about affected versions (so it is easy for humans and tools to encode)
Provide guidance to interpret the record (so a tool can determine if any given version is affected or not).

When software versioning is linear, listing the affected range is sufficient. A tool should interpret versions outside of the range as 'unaffected'. This proposal is perfectly adequate and intuitive to use.

The difficulty comes when software has multiple concurrently maintained branches (e.g., Linux, OpenSSL). Ranges that span multiple branches may not make sense. Often CVE assigner does not make statements about older branches, they may not be listed in a CVE, but are likely affected. Without this additional context (like EOL) a tool can misreport an older version as unaffected. That is dangerous because people may have a vulnerability they should care about, but tools may fail to warn them.

Take https://www.linuxkernelcves.com/cves/CVE-2021-3655

versionGroup: fixed version 4.14: 4.14.240 4.19: 4.19.198 5.10: 5.10.51 5.12: 5.12.18 5.13: 5.13.3 5.4: 5.4.133

Since 4.15.1 isn't listed there should a tool report it as unaffected?

My suggestion to solve the info capture problem:

Split the entire version tree into linear range segments. Each range is either affected, unaffected, or unknown.
Group ranges that form linear segments themselves (represents a maintained branch or a fork). Identify them with a versionGroup. Most open-source software with linear versioning will simply have one versionGroup (and can be omitted from the record)
Within each versionGroup list the affected ranges.

To interpret the records:

Any version that falls in an affected range is affected. Any version that falls in an explicitly unaffected range is unaffected.
a. If a versionGroup is not defined: anything outside of the range is unaffected. b. If a versionGroup is defined: anything not listed in the scope of the versionGroup is unaffected.
a version not covered by any range or versionGroups should be interpreted as "unknown" or "likely affected" (eg., EOL branches) except when the newest listed versionGroup indicates a fix. In that case, it should be interpreted as "likely unaffected" (e.g., future branches that did not exist at the time of CVE assignment).

At the minimum when there is only one range of affected versions, this is sufficient:

versions: [
   { version: '*', before : '5.14-rc1' }
]

When there are branches with multiple ranges, this should be sufficient:

versions: [
   { versionGroup: 4.14, start: 4.14.0, before: 4.14.240 }
   { versionGroup: 4.19, start: 4.19.0, before: 4.19.198 }
   { versionGroup: 5.10, start: 5.10.0, before: 5.10.51 }
]

A few optional entries reinforce the facts and would help tooling make accurate determinations.

   { start: 16.0.0, status: 'not-affected' }

How about:

versions: [
 {
   "range": [ semver, git, patch, other ] // optional, missing means not a range
   "versionGroup": string //  (optional) represents a version branch, group, or a major version (e.g. 10.0, 3.1.*) where these ranges are meaningful.
   "version": string,    // specific version, or start of range; required
   "before": string,     // range ends just before this version; required when range is present
   "status": [ affected (default), unaffected, undefined, likely-affected, unlikely-affected] // optional, consider 'affected' if absent
 }
]

oliverchang commented 3 years ago

My concern with versionGroup as-is is that it relies on the consumer of such entries to know how to map version to a versionGroup. There could be many different ways to do so, depending on the versioning scheme or ecosystem.

We may want something like this instead to describe a "versionGroup" instead of just a "string".

{ versionGroup: { start: 4.14.0, before: 4.15.0 }, version: 4.14.0, before: 4.14.240 }

This does make the entries a bit difficult to read as a human if they're inline (because they are two entries in each), so perhaps it could be indirect by adding a new field to define versionGroups, and have the individual ranges reference that (as per your examples).

"versionGroups: {
  "4.14": {
    "start": "4.14.0",
    "before": "4.15.0",
  }
}

"versions": [ { versionGroup: 4.14, version: 4.14.0, before: 4.14.240 } ]

On interpreting these entries: I think different consumers will want some flexibility depending on risk / noise appetite, as it's ultimately up to the consumer how to deal with incomplete data.

For example, they could assume (or know) the data is high quality/complete and ignore versionGroup altogether, and assuming anything that's unlisted is strictly unaffected (rather than unspecified / unknown).

My understanding is grouping ranges by versionGroup is that it creates some implicit "unspecified" ranges (i.e. any unspecified groups of versions are implied to be "unspecified"). So, a consumer could also do as you suggested: where unspecified (implicit or explicit) is assumed to be likely vulnerable.

Using this as an example again:

versions: [
   { versionGroup: 4.14, version: 4.14.0, before: 4.14.240 }
   { versionGroup: 4.19, version: 4.19.0, before: 4.19.198 }
   { versionGroup: 5.10, version: 5.10.0, before: 5.10.51 }
]

Testing 4.14.99, this matches group 4.14, does not match any ranges there. This is unambiguously unaffected.

Testing 4.15.0, No groups matched, which means 4.15.0 is unspecified. This is up to the consumer how to interpret it.

Testing 5.11.0, No groups matched, so it's unspecified. but it's higher than any listed before (i.e. "5.10.51"), so a consumer may interpret this as unaffected.

Ignoring versionGroup completely also has the same affect as treating "unspecified" ranges as "unaffected".

Does my understanding seem correct?

In any case, this doesn't change the meaning of {version, before} within a versionGroup -- because a version that doesn't match any (non-unspecified) ranges within a group still unambiguously means "unaffected". So I don't know if it answers whether we need both affectedVersions and testedVersions for the reasons @rsc outlined in https://github.com/CVEProject/cve-schema/issues/87#issuecomment-894649276 ?

iamamoose commented 3 years ago

Chandan proposes " { versionGroup: 4.14, start: 4.14.0, before: 4.14.240 } " and this matches my experience handling vulnerability metadata for OpenSSL and various Apache projects (where they are not semver).

For OpenSSL we combined having a 'fixed version' (for a given major version) along with listing all the known affected versions indvidually: https://www.openssl.org/news/vulnerabilities.xml

<affects base="1.1.1" version="1.1.1e"/>
<affects base="1.1.1" version="1.1.1f"/>
<fixed base="1.1.1" version="1.1.1g" date="20200421">
<git hash="eb563247aef3e83dda7679c43f9649270462e5b1"/>
</fixed>

which would become " { versionGroup: 1.1.1, start: 1.1.1d, before: 1.1.1g } "

<affects base="1.1.1" version="1.1.1a"/>
<affects base="1.1.1" version="1.1.1b"/>
<affects base="1.1.1" version="1.1.1c"/>
<affects base="1.1.1" version="1.1.1d"/>
<affects base="1.0.2" version="1.0.2"/>
<affects base="1.0.2" version="1.0.2a"/>
<affects base="1.0.2" version="1.0.2b"/>
...
<affects base="1.0.2" version="1.0.2t"/>
<fixed base="1.1.1" version="1.1.1e" date="20191206">
<git hash="419102400a2811582a7a3d4a4e317d72e5ce0a8f"/>
</fixed>
<fixed base="1.0.2" version="1.0.2u" date="20191220">
<git hash="f1c5eea8a817075d31e43f5876993c6710238c98"/>
</fixed>

which would become " { versionGroup: 1.1.1, start: 1.1.1, before: 1.1.1e } , { versionGroup: 1.0.2, start: 1.0.2, before: 1.0.2u } , "

Problem 1: quite often the OSS project doesn't have resources to make sure we know "earliest affected version" (for example it might be too hard to determine what old things are affected particularly if things got refactored). So does the lack of 1.0.2 in that first example mean it's not vulnerable (which it does) or that we no longer look at how 1.0.2 is affected?

Problem 2: So if there is an old EOL branch it's quite likely the OSS project won't even look if that one was vulnerable. So how about the OpenSSL 0.9.8 version? As the upstream we don't tell you. But other consumers of OpenSSL who patched it after upstream stopped (like long life distro branches, Red Hat etc), probably did that work to figure out all the affected EOL versions too.

Second example which is similar, before I switched ASF httpd to JSON 4.0....

view-source:https://web.archive.org/web/20200416103646/http://httpd.apache.org/security/vulnerabilities-httpd.xml

<fixed base="2.4" version="2.4.27" date="20170711"/>
<fixed base="2.2" version="2.2.34" date="20170711"/>
...
<affects prod="httpd" version="2.4.1"/>
...
<affects prod="httpd" version="2.2.0"/>

So that would become " { versionGroup: 2.2, start: 2.2.0, before: 2.2.34 } , { versionGroup: 2.4, start: 2.4.1, before: 2.4.27 } , "

But for ASF when we hadn't verified but it looked plausible....

<maybeaffects prod="httpd" version="2.0.49"/>

(Although for the JSON format I just lazy converted those into 'affects')

(We also had the occasional "won't fix" where "2.2. is affected, we didn't fix it in 2.2" and the occasional "2.2. is affected, it's fixed by an available patch/svn head, but not in any released version")

Problem 3: Distro versions will vary. You could normally just say this is out of scope, but it's likely most of the users of say OpenSSL will be using a distro packaged version. And they backport security fixes. It's why at Red Hat we introduced OVAL for all our errata so you could map a given Red Hat RPM version of (Apache HTTP Server, OpenSSL, anything) to CVE.

chandanbn commented 3 years ago

Ignoring versionGroup completely also has the same affect as treating "unspecified" ranges as "unaffected".

Does my understanding seem correct?

As you said if the data set is complete, we don't need versionGroup. A tool can easily say anything unlisted in unaffected. When the data is incomplete (and it will often be), telling consumers/tools to assume the unlisted is unaffected is dangerous.

Take CVE-2021-33909 for example: It was introduced by a commit 058504edd02667eef8fac9be27ab3ea74332e9b4 in Linux Kernel 3.16 It was fixed by commit 8cae8cd89f05f6de223d63e6d15e31c8ba9cf53b in a v5.14-rc branch.

Whoever requested the CVE at the time of assignment may have said it affected Linux Kernel from 3.16 to before 5.13.4. Which was likely that only that information was available at the time. That is sufficient to get a CVE - we should not be waiting for all the information to be available.

Now that vulnerability seems to have been fixed in each of the actively maintained Linux kernel branches - each fixed with a different commit id for eg.,

4.14 --> before: 3c07d1335d17ae0411101024de438dbc3734e992 4.19 --> before: 6de9f0bf7cacc772a618699f9ed5c9f6fca58a1d 5.13 --> before: 71de462034c69525a5049fbdf3903c5833cbce04

The entry in OSV seems to have picked only one affected range with a fix commit id for just one branch 4.14. So the list of versions listed as affected is not telling the whole truth. For eg., it does not list 5.13.3 as affected. If one were to take anything not listed as unaffected, then a tool consuming that data would wrongly (and dangerously) say 5.13.3 is unaffected which is not true here.

I believe we all agree:

Getting a completely accurate vuln to software mapping is hard in some cases. What tools and humans generate can be incomplete or change over time. CVE assignment/ publishing record should not wait for this.
Without a complete data set plus the information about branching and if ranges span branches, it is impossible for a tool to make the affected/not-affected determinations.
Capturing machine readable information about branching seems out of scope for CVE. (Question: Does semvers have a convention for how branches are versioned?)

Given the above:

We try to make it easier for people to capture this information (even if partial) in a consistent, intuitive, and uniform way.
Provide ways to capture assertive not-affected statements since many CNAs state that in the CVE descriptions.
Provide a way to limit the scope of assertions (versionGroup) so datasets are at least complete for some areas.
Provide heuristics for tools to make sense of partial information so they can still make safer affected/likely-affected/not-affected determinations.

oliverchang commented 3 years ago

The entry in OSV seems to have picked only one affected range with a fix commit id for just one branch 4.14. So the list of versions listed as affected is not telling the whole truth. For eg., it does not list 5.13.3 as affected. If one were to take anything not listed as unaffected, then a tool consuming that data would wrongly (and dangerously) say 5.13.3 is unaffected which is not true here.

Thanks for flagging this example! This was actually an intentional decision by the providers of this data to track different branches in different vulnerability IDs. For example, for the 5.13 branch, this is tracked by https://osv.dev/vulnerability/UVI-2021-1001182. There are other variations for different branches, and with open source we the ability to be precise/complete with tooling to detect cherry picks across branches etc.

But yes, I understand the concern with incomplete data in general!

Capturing machine readable information about branching seems out of scope for CVE. (Question: Does semvers have a convention for how branches are versioned?)

I don't believe semver (or most versioning) schemes enforce any conventions around branch versioning. If we provide clear rules on how to match a version to a group by saying it's a string prefix, (i.e. "versionGroup": "2.4."), perhaps that will sufficient to avoid having to capture explicit branch information?

Given the above:

We try to make it easier for people to capture this information (even if partial) in a consistent, intuitive, and uniform way.

Provide ways to capture assertive not-affected statements since many CNAs state that in the CVE descriptions.

Provide a way to limit the scope of assertions (versionGroup) so datasets are at least complete for some areas.

Provide heuristics for tools to make sense of partial information so they can still make safer affected/likely-affected/not-affected determinations.

What you proposed with versionGroups seems like it should address most of these, but I think it adds a fair bit of complexity and edge cases for processors to handle.

Perhaps another flatter alternative, and one that tries to make the two cases (complete vs incomplete data) more explicit would be:

"versions": [
 {
   "range": string,
   "version": string,    // specific version, or start of range; required
   "before": string,     // range ends just before this version; required when range is present
   "status": string // optional can be "affected" (default) / "unaffected". 
 }
]

"versionsInfo": {
   "complete": bool,  // true or false based on if the provider/CNA believes the versions are comprehensive. 
   "knownVersionPrefixes": [ string ] // required if complete == false
 }

Semantics

When a version is not included in the list of versions ranges, it means that the version is

"unspecified", if versionsInfo.complete is false.
"unaffected", if versionsInfo.complete is true. a "status": "unaffected" is redundant in this case.

status: "unaffected" and status: "affected" ranges cannot overlap in any way.

When versionsInfo.complete is false, versionsInfo.knownVersionPrefixes must be specified with at least one prefix.

@chandanbn you also had "undefined, likely-affected, unlikely-affected" in your status, but I think these aren't needed because:

undefined is implied by lack of existence (if versionsInfo.complete is false)
likely-affected, unlikely-affected depends on the context of evaluating these conditions and should be an output of the algorithm instead (see below).

An algorithm to interpret these results

An algorithm can give four possible results about an input version: "affected", "unaffected", "likely-affected", "likely-unaffected".

If versionsInfo.complete is true, checking if a version is "affected" just entails checking if the version is included in any provided version ranges (with status "affected"). Otherwise it's "unaffected".

If versionsInfo.complete is false, a version is still checked against all the provided version ranges. If it matches a range, then it should be either "affected" or "unaffected" based on the range's status.

Otherwise, it's "unspecified".

If the version is unspecified at this point, then tooling can interpret it like so:

If the version matches a listed version prefix in versionsInfo.knownVersionPrefixes, then it's "unaffected".
If the version does not match any versionsInfo.knownVersionPrefixes, and it's greater than or equal to max(before) in all ranges, then it's "likely-unaffected", because it likely indicates a version that came in a later branch.
Otherwise, the version should be "likely-affected".

@rsc @chandanbn what do you think? I think if we do it this way, we can also stick with a single versions list.

chandanbn commented 3 years ago

@oliverchang I like an indicator of completeness (versionsInfo.complete).

versionsInfo.knownVersionPrefixes seems like an aggregation of versionGroups. Not sure if we are achieving anything by separating them out to a different field.

Having some guidance on how to record a versionGroup name should also help tooling. Prefix matching can be tough unless there is an odd looking period at the end (2.4 will match 2.41.3, so it should be either recorded as 2.4. or 2.4.*). Prefix/glob matching may not work when a product does patching instead of semver:

product: 'Windows'
versions: [
  versionGroup: '10', before: 'patch-6' 
  versionGroup: '11', before: 'patch-2' 
]

oliverchang commented 3 years ago

versionsInfo.knownVersionPrefixes seems like an aggregation of versionGroups. Not sure if we are achieving anything by separating them out to a different field.

I think it simplifies the evaluation algorithm and prevents some edge cases when dealing with open ranges within a a group.

e.g.

{"versionGroup: "4.14", before: "*"}
{"versionGroup: "4.15", before: "*"}

The interpretation here would be, everything in 4.14 and 4.15 is affected.

In the case this describes an incomplete set of versions, if we have "4.16.1". It should be "unlikely-unaffected" because it's newer than all versions, but there's no actual versions to compare it to in the two ranges (they're both "*"). There would have to be a way to compare "4.16.1" to an actual group ("4.15"), which seems difficult to do in a generalisable way.

It also adds complexity to evaluating these rules even if this describes a complete set of versions.

Having some guidance on how to record a versionGroup name should also help tooling. Prefix matching can be tough unless there is an odd looking period at the end (2.4 will match 2.41.3, so it should be either recorded as 2.4. or 2.4.*). Prefix/glob matching may not work when a product does patching instead of semver.

Sure, but I think since versionGroup/Prefix is essential to determining if a version is affected, it needs to be unambiguously computable by tooling. I think we will need either prefix (or pattern matching/regex) for that.

Re patching, perhaps another way would be to just have:

{version: '10', before: 'patch-6', "type": "patch"}
{version: '11', before: 'patch-7', "type": "patch"}

? That way, versionGroup/Prefix can have consistent automatable rules.

rsc commented 3 years ago

@chandanbn thanks for the example of the Linux kernel vulnerability. It looks like that bug may go back all the way to 2.6.12 and no one has taken the time to figure out exactly which versions are affected, which is a great case to try to encode.

@oliverchang and I spoke for a while and didn't come up with an obvious win yet. We'll circle back early next week.

rsc commented 3 years ago

This issue is about making version information computable, meaning that there is a clear algorithm IsVersionAffected that takes as input a CVE record and a specific version and answers the question “is this version affected by this CVE?”

There are two concerns: (1) defining something precise enough for an algorithm to implement, and (2) defining something clear enough that people writing these records - and also the people implementing the algorithm - get it right.

There are many, many ways to do (1) but relatively fewer ways to do (2).

We already have the problem of needing to define specific version types to make even a less-than comparison work. A versionGroup adds another kind of definition on top of that. Also, version groups assume a particular development model that may or may not hold. For example if v4 and v5 are being developed independently, then you might want to say that it is fixed in v4.19.2 onward within v4 (including v4.20 but not including v5) and then separately also fixed in v5 starting at v5.13.4.

It seems like it would be better to have fewer concepts if we can, which is to say leave versionGroup out if we can.

I think we should separate out point-wise assertions from ranges, because pointwise assertions don't require understanding the relative ordering of versions. Suppose we did this:

versionList: [{
    version: specific version
    status: unknown / affected / unaffected
}]
versionRanges: [{
    type: string
    initialStatus: unknown / affected / unaffected (optional; default unknown)
    statusChanges: [{
        version: version where status changes
        status: unknown / affected / unaffected
    }]
}]

This would replace both the affectedVersions and testedVersions in my previous attempt.

If a version appears explicitly in the version list, then the answer is the given status. That's the easy part.

Otherwise, we consult the ranges. Each range specifies the version type (semver, git, linux, etc) and an optional initial status and then a "timeline" ("versionline"?) of where the status changes. For the Linux kernel bug we could use:

versionRanges: [
    {
        type: linux
        initialStatus: unaffected
        statusChanges: [
            {start: v3.16, status: affected}
            {start: v4.19.198, status: unaffected}
            {start: v4.20, status: affected}
            {start: v5.13.4, status: unaffected}
        ]
    }
]

This effectively encodes this picture of the version timeline:

  |  unaffected at start of timeline
  |
  | 
  o  v3.16 changes to affected
  X
  X
  X
  o  v4.19.198 changes to unaffected
  |
  |
  |
  o  v4.20 changes back to affected
  X
  X
  X
  o  v5.13.4 changes back to unaffected
  |
  |
  |  unaffected for rest of timeline

Normally you'd have only one versionRange for a given type. This particular issue might add a second range of type "git" to list the specific commit hashes.

The algorithm is to find the versionRange for the type of version you are holding and then do:

status = initialStatus
for c in statusChagnes
    if version >= c.start
        status = c.status
return status

This seems pretty clear for both readers and programmers.

I think this encodes the ranges clearly and without the duplication that's needed for a list of [start,before) spans (where each one's before is usually the next one's start).

It also explicitly allows status "unknown" (and makes that the default), and we could add status "likely" or "probable" if necessary.

Thoughts?

chandanbn commented 3 years ago

@rsc Wouldn't this be essentially restricting the use of existing versionAffected to '>=', '!>='? If that restriction yields less ambiguous and more machinable records then reduction in expressibility is ok.

if version >= c.start

Isn't the comparison here still the version-tree (directed acyclic graph) based comparison?

For git, one must query the SCM to find one commit is hash is before or after another commit hash. Since we capture the git repo URL, I feel this is computable.

For semvers or anything else, I see a few requirements:

the list has to be first sorted on start versions (easy).
should have at least one entry for the start of every branch, if the previous branch had a fix. This has to be first version of that branch (hard, because not everyone may recollect the first version in a branch). In the example, 4.20.0-rc1 is likely the first start value. Otherwise the algorithm will say 4.20.0-rc1 is unaffected. 4.20.0-rc1 is greater than 4.19.198 but less than 4.20 using semver comparison.

BTW, for the Linux kernel example above only the seven fixed branches seem to be tracked. The sum total of Affected versions (aggregated from those 7 ids in OSV) would miss any version from an unmaintained Linux kernel branch (such as 5.12.10). However using the suggested record format and the algorithm querying the SCM (git repo) on git commit ids, one would in theory correctly identify 5.12.10 as affected.

rsc commented 3 years ago

@rsc Wouldn't this be essentially restricting the use of existing versionAffected to '>=', '!>='? If that restriction yields less ambiguous and more machinable records then reduction in expressibility is ok.

I suppose it's restricting the use to purely a sequence of '>=', with the rule that later entries override earlier ones. And yes, I think that that restriction makes the records easier to interpret and probably also easier to write.

Isn't the comparison here still the version-tree (directed acyclic graph) based comparison?

Yes, the comparison has to be defined by the 'type' entry in the range object. If the type is 'semver' then https://semver.org defines ordering. If the type is 'git' then ordering can only be checked with respect to the actual repo. And we can define other numeric types (I assumed a 'linux' type above) as needed. We might want to define a 'dotted' type that is only for dot-separated numbers, with the obvious meaning. (All the subtlety about semver etc happens when you get to variations like 1.2-3 or 1.2rc5.)

the list has to be first sorted on start versions (easy).

Agreed.

should have at least one entry for the start of every branch of the previous branch had a fix and this has to be first version of that branch (hard, because not everyone may recollect the first version in a branch).

Agreed. And that really is a concern, but we could potentially define that in the semver ordering you can write 4.20 (no third number) to mean anything starting with 4.20, including prereleases.

BTW, for the Linux kernel example above only the seven fixed branches seem to be tracked. The sum total of Affected versions (aggregated from those 7 ids in OSV) would miss any version from an unmaintained Linux kernel branch (such as 5.12.10).

Yes, I agree with that. I don't think the 7 different IDs are a good approach. It actually makes it almost impossible to say what is and is not affected. @oliverchang is going to talk to the UVI team about why they chose that approach. We should strive for a single ID in CVE.

However using the suggested record format and the algorithm querying the SCM (git repo) on git commit ids, one would in theory correctly identify 5.12.10 as affected.

Yes, and one of the things we hope OSV will be able to contribute to the CVE ecosystem once data is in this format is suggesting updates where the git commits indicate that the numeric version ranges can be made more precise.

rsc commented 3 years ago

Regarding "sorted on start versions (easy)":

I hope that CVE records will be written with sorted lists anyway, perhaps with automation to keep them sorted, but I agree that clients should be expected to sort too.

(Technically speaking it is not necessary for the client to sort, only to find the status line with the largest version <= the version being checked. That's O(n) instead of O(n log n). But I think it is fine to say that clients should behave as if they sorted the list and leave not sorting as an optimization.)

Most versioning numbering systems have a clear linear ordering: v1.2.3 before v1.2.4 before v1.3.0 before v2.0.0. Sorting is indeed easy there.

For a Git commit graph, all we can do is sort by topological order (parents before children). That's still easy, it's just important to recognize it as not quite normal sorting. The algorithm and the data format still make sense for this kind of directed acyclic graph. For example the Git commit ranges for CVE-2021-33909 would be written:

type: git
repo: https://url
initialStatus: unaffected
statusChanges: [
    {status: affected, start: 058504edd02667eef8fac9be27ab3ea74332e9b4}
    {status: unaffected, start: 3533e50cbee8ff086bfa04176ac42a01ee3db37d}
    {status: unaffected, start: c5157b3e775dac31d51b11f993a06a84dc11fc8c}
    {status: unaffected, start: 3c07d1335d17ae0411101024de438dbc3734e992}
    {status: unaffected, start: 6de9f0bf7cacc772a618699f9ed5c9f6fca58a1d}
    {status: unaffected, start: c1dafbb26164f43f2bb70bee9e5c4e1cad228ca7}
    {status: unaffected, start: 174c34d9cda1b5818419b8f5a332ced10755e52f}
    {status: unaffected, start: 058504edd02667eef8fac9be27ab3ea74332e9b4}
]

This turns out to be a clear improvement over the original ranges, because you don't have to say the commit that introduced the bug 7 times.

ElectricNroff commented 3 years ago

Maybe the best approach is to have multiple options for expressing version information, depending (in part) on whether the product has a support policy (explicit or implied). The type of information submitted to the CVE Program tends to have a bifurcation depending on whether a support policy exists, even when the existence of a support policy is not mentioned within the vulnerability announcement itself.

Although CVE is not really "about" prescriptive information from vendors, it may be more likely for vendors to participate if the information displayed in CVE Records, and the information available to CVE-based tools, is closely aligned to what the vendor provides directly to customers, either within vulnerability announcements or during customer-support interactions. In other words, the approach potentially helps with CVE adoption.

The hope is to develop the best practical algorithm within the context of what data providers have traditionally been willing to submit to the CVE Program. It should avoid soliciting extra information such as "{start: v4.20, status: affected}" which, in practice, is very rare to see from program participants. For example, many people who rely on the 4.19.* longterm-supported Linux kernel series are unaware of whether 4.20.x ever existed (or whether 5.0 came right after a 4.19.x version). Similarly, if a vulnerability announcement mentions a 3.4.x fix and a 3.6.x fix, does that mean that 3.5.x is "affected" and potentially important, or does it mean that odd minor-version numbers are never visible outside of the development staff?

CVE Records are for vulnerabilities in released software. For purposes of CVE, it is not necessary to state which commits are associated with the vulnerability lifecycle, or to express whether any specific pre-release software came before or after a released version.

Here is a very rough outline of how the schema could accept four different major types of version specification.

There is a support policy, and semver is used. The information should be expressed as a series of assessedSemverRegexp items.

Semantics:

If the consumer's product version does not match any of the assessedSemverRegexp regular expressions, then the output of the algorithm is the word Unsupported. This means that the vendor is recommending against use of that version. For vulnerability management purposes, this may be treated the same as the word Affected.

Otherwise, if one regular expression is matched, and assessmentPending is found, then the output of the algorithm is the word Unknown. Otherwise, if one regular expression is matched, and the consumer's product version is greater than or equal to the fixedStartingFrom value, then the output of the algorithm is the word Fixed. Otherwise, if one regular expression is matched, and the consumer's product version is within any specified otherUnaffected range, then the output of the algorithm is the word Fixed. Otherwise, the output of the algorithm is the word Affected.

Note: otherUnaffected is optional. Although producers are free to choose their own use cases, the envisioned primary use case is a situation where the vulnerability was introduced in a very recent version. Thus, there are expected to be many customer deployments that are completely safe (e.g., not affected by any CVE or any vulnerability that was silently fixed by the vendor), and therefore it's a waste of customer effort to trigger updates. In one example below, only 4.9.359 was affected. Commercial software vendors typically only express the version numbers of new versions that have fixed a vulnerability. From the perspective of many commercial software vendors, a vulnerability announcement has two purposes: to protect customers from attacks, and to lower support costs by reducing the variety of versions deployed in the field.

example with only one assessedSemverRegexp item

{"assessedSemverRegexp": ".", "fixedStartingFrom": "20.1.34"}

example with multiple assessedSemverRegexp items

{"assessedSemverRegexp": "^5\.", "fixedStartingFrom": "5.0.0"}
{"assessedSemverRegexp": "^4\.14\.", "fixedStartingFrom": "4.14.250", "otherUnaffected": [{"semverBegin": "4.14.0", "semverEnd": "4.14.0"}, {"semverBegin": "4.14.50", "semverEnd": "4.14.89"}]}
{"assessedSemverRegexp": "^4\.9\.", "fixedStartingFrom": "4.9.360", "otherUnaffected": [{"semverBegin": "4.9.0", "semverEnd": "4.9.358"}]}
{"assessedSemverRegexp": "^4\.4\.", "assessmentPending": true}

There is a support policy, but semver is not used.

Semantics: if the customer's product version does not equal any of the assessedBaseVersion values, then the output of the algorithm is the word Unsupported. For vulnerability management, this may be treated the same as the word Affected. Otherwise, if the customer's product version equals one of the updateOptions values, or equals one of the otherUnaffected values, then the output of the algorithm is the word Fixed. Otherwise, the output of the algorithm is the word Affected. Clearly, vendors who don't (or can't) provide updateOptions values will trigger many false positives (if the CVE List is the sole data source for vulnerability assessment).

This is primarily for vendors who submit CVE Records that state a set of product versions, each of which may be vulnerable depending on whether an update action has occurred (e.g., installing a service pack, fix pack, hotfix, patch, etc.). In many cases, the CVE Record does not fully describe the update action (possibly because that action is dynamically chosen based on details of a customer environment). Thus, updateOptions (a set of update actions, any of which is sufficient to fix the vulnerability) can be specified, but is optional.

example in which updateOptions is not provided

{"assessedBaseVersion": "2.0"}
{"assessedBaseVersion": "3.0"}
{"assessedBaseVersion": "3.5"}

examples in which updateOptions is provided

{"assessedBaseVersion": "3.0", "updateOptions": ["3.0 HF17", "3.0 SP1 HF6"], "otherUnaffected": ["3.0 HF1", "3.0 HF2", "3.0 HF3"]}

{"assessedBaseVersion": "10", "updateOptions": ["October 2021 monthly updates", "23456"]}

There is no known support policy. The data provider simply specifies what test cases were considered, and what happened. In general, a "test case" can be any mechanism (e.g., observing runtime behavior, or assessing the source code or executable code) that has the possibility of identifying an affected version.

Semantics

If the consumer's product version was tested and found to be affected, then the output of the algorithm is the word Affected. If the consumer's product version was tested and found to be not affected, then the output of the algorithm is the word Fixed. Otherwise, the output of the algorithm is the word Unknown.

A. examples that may be typical of automated testing

{"semverTestCases": [{"semverBegin": "3.0.0", "semverEnd": "3.15.12"}, {"semverBegin": "4.0.0", "semverEnd": "4.3.8"}], "affected": ["3.3.1", "3.3.2", "3.3.3", "3.3.4"]}
{"semverTestCases": [{"semverBegin": "1.0.0", "semverEnd": "22.3.1"}], "affected": ["5.6.2"]}

B. examples that may be typical of manual testing

{"semverTestCases": [{"semverBegin": "4.0.6", "semverEnd": "4.0.6"}, {"semverBegin": "5.0.3", "semverEnd": "5.0.3"}], "affected": ["4.0.6", "5.0.3]}
{"semverTestCases": [{"semverBegin": "4.0.6", "semverEnd": "4.0.6"}, {"semverBegin": "5.0.0", "semverEnd": "5.0.2"}], "affected": ["4.0.6"]}
{"miscTestCases": ["Zeta", "January 2024", "LMNOP"], "affected": ["Zeta", "January 2024", "LMNOP"]}

A data provider is not offering complete information, and mainly wishes to comment about the existence of one (typically) highest-numbered fixed version. Often, the data provider wishes to alert the public that a vulnerability was found and knows that one version number, but the data provider does not have the resources to convey information about the range of affected versions. Also, they are not going to wait until the fix is backported to an older version series. For example, this scenario can occur when the data provider is producing a CVE Record on the basis of reading one changelog or Release Notes document, and has no other information about the vulnerability lifecycle. (The data provider may have a rough idea that they can comment on, e.g., a vulnerability in a feature that was added recently.) It is implied that the most recent released version before the fixed version is one vulnerable version, but the data provider is not required to know or convey that version number in the specificAffected field.

Semantics

If the consumer's product version is a semver on the unaffectedSemverList, or a later semver, or a version on an unaffectedList, then the output of the algorithm is the word Fixed. Otherwise, if the consumer's product version is in the specificAffected field, then the output of the algorithm is the word Affected. Otherwise the output of the algorithm is the word Unknown (possibly accompanied by a comment).

{"unaffectedSemverList": ["5.12.16"], "specificAffected": [], "commentOnAffected": "at least one earlier version"}
{"unaffectedSemverList": ["5.12.16"], "specificAffected": [], "commentOnAffected": "likely to be few earlier versions"}
{"unaffectedSemverList": ["5.12.16"], "specificAffected": [], "commentOnAffected": "likely to be many earlier versions"}
{"unaffectedList": ["Phi"], "specificAffected": ["Upsilon", "Tau"], "commentOnAffected": "likely to be few earlier versions"}
{"unaffectedList": ["Phi"], "specificAffected": ["Upsilon", "Tau"], "commentOnAffected": "likely to be many earlier versions"}

rsc commented 3 years ago

The hope is to develop the best practical algorithm within the context of what data providers have traditionally been willing to submit to the CVE Program.

For what it's worth, this seems self-defeating to me. Yes, we have to be able to cope with what vendors provide, but for vulnerability management to scale industry-wide, we also need to encourage more precise data than the current English text.

I think the idea of comments and suggested upgrades are interesting, but those could be added to the proposed object in a separate discussion. (This is definitely a benefit of an object.)

Finally, speaking from experience, regular expressions are not a good answer: they are far too easy to embed subtle bugs in and too hard to scrutinize for those bugs. We should probably avoid them here.

zmanion commented 3 years ago

A few comments, some of which have already been discussed but I didn't see a clear decision:

A vulnerability in an upstream dependency may or may not be inherited or transitive to the software that imports the dependency. If the dependency is fully imported, then in a strict sense, yes, the vulnerability is also imported. The vulnerability may or may not be exposed or exploitable, depending on how the dependency is used in context. Or part of the dependency might be imported, with or without the vulnerability. So "vulnerable upstream dependency identified" is useful but can't be assumed to mean the subject at hand is affected.
I like the idea of the separate lists for "Tested" and "Affected". I think it was made clear that a version or range not listed in tested has clear meaning: that version or range was not tested. I'm not sure I'd trust a claim of "tested and found not affected" but at least the semantics can be clear. "Not affected" is trickier. What does the absence of a version or range in the "Affected" list mean? I'd like to assume vendors/suppliers/projects thoroughly investigate all (currently supported) versions, and with that assumption I could interpret "not listed as affected" as "not affected." But the assumption of comprehensive testing may be flawed.

Another approach to the "Tested" list is to just stick with affected/not affected but identify the subject of the claim. Researcher can state that "version 1.1 is affected" and supplier/vendor/project can state "version 1.1 is not affected" and I can parse out that there's a disagreement and I need to go investigate. This avoids giving the vendor/project/supplier ultimate authority in the claim, in that researcher testing is inferior to vendor statements (this might often be true, but not always, to a non-trivial degree).

If comprehensive testing is not assumed (i.e., not listed as affected == not affected), then a way to convey "Not affected" is useful. In this model, unlisted version implies nothing, there needs to be an explicit statement of affected or not.

And another list, "Supported" (and possibly unsupported).

As a consumer of this information, I'd like to know who is making the claim, what version/ranges are affected, what are not, what is unknown, and what is unsupported (or wontfix).

rsc commented 3 years ago

The tested/affected separation was partly to have two different default statuses (untested/unknown for tested, unaffected for affected), but that ended up more confusing than helpful. Instead in the latest suggestions there is always an explicit status, which can be unaffected/affected/unknown. We could potentially think about adding an explicit unsupported, since that seems to be the most common reason for unknown.

I believe the information about who is making the claim is supposed to be from 'requester' elsewhere in the record, and then there is the adpContainer for extra statements by others. If additional clarity is needed around authorship, it seems like that should be a separate issue discussion from version details.

rsc commented 3 years ago

Hello all. I took away from the discussion at the last QWG meeting that:

Generally people seem to appreciate the timeline form.
People want a way to make clear explicitly when a particular timeline ends, such as the 3.0 branch ending at 3.∞.

As I noted before, the trick is to balance (1) defining something precise enough for an algorithm to implement, and (2) defining something clear enough that people writing these records - and also the people implementing the algorithm - get it right.

New schema

Here is a new potential schema incorporating that feedback and that I hope is still a reasonable balance of (1) and (2):

versions: [{
    version: $version
    status: $status  // unknown, affected, unaffected; unsupported?

    range: string (‘semver’, ‘git’, ..., to define meaning of <)
    repo: string (optional for range ‘git’)
    limit: $versionLimit (this range stops just before limit; can use * for “infinity” aka "maxuint")
    changes: [{
        at: version where status changes
        status: ...
    }]
}]

An object in the versions list can be either:

a simple {version: V, status: S}, \ which indicates the status of the single version V.
a range {version: V, range: R, limit: L, status: S, changes: C}, \ which indicates the status of the half-open interval [V, L) (that is, V is included but L is not). The range starts with V having status S and then changes over time according to the events listed in C.

The algorithm for deciding the status of a particular version V is then:

for entry in versions
    if entry.limit is not present and v == entry.version
        return entry.status
    if entry.limit is present and v <= entry.version and v < entry.limit
        status = entry.status
        for change in entry.changes
            if v >= change.at
                status = change.status
        return status

return “unknown”

The rest of this comment gives worked examples for the cases in Chandan’s presentation as well as a Git-based case that UVI wants to be able to encode that was part of the motivation for the previous iteration of the schema.

Single branch

versions: [
    {
        version: 1.1, limit: 1.*, range: semver,
        status: affected, 
        changes: [
            {at: 1.6, status: unaffected}
        ]
    }
]

Single branch, two transitions

versions: [
    {
        version: 1.1, limit: 1.*, range: semver,
        status: unknown, 
        changes: [
            {at: 1.4, status: affected},
            {at: 1.6, status: unaffected}
        ]
    }
]

Notes:

If it were desirable in this instance is to exclude 1.10 from having a determination, then the limit: 1.* would become limit 1.9.

Three branches (variant 1)

versions: [
    {
        version: 3.0, limit: 3.*, range: semver,
        status: affected, 
        changes: [{at: 3.4, status: unaffected}]
    },
    {
        version: 4.0, limit: 4.*, range: semver,
        status: affected, 
    },
    {
        version: 5.0, limit: *, range: semver,
        status: affected, 
        changes: [{at: 5.2, status: unaffected}]
    }
]

Notes:

In this case, the 4.0 branch is being marked explicitly affected. If the middle version object were removed, then 4.0 would fall back to “unknown”, the default for anything not listed.
If the 4.0 branch was out of support, then we might consider an “unsupported” status that could be used for this branch instead.
In this case, the 5.0 version range object has a limit of “*”, so it applies to all higher versions, including 6.0, 7.0, and so on. If instead the CVE wanted to claim only that 5 was fixed and not make any statements about 6, it could use “5.*” as the limit.

Three branches (variant 2)

versions: [
    {
        version: 3.0, limit: 3.*, range: semver,
        status: unaffected, 
        changes: [
            {at: 3.3, status: affected},
            {at: 3.5, status: unaffected}
        ]
    },
    {
        version: 4.0, limit: 4.*, range: semver,
        status: unaffected, 
    },
    {
        version: 5.0, limit: *, range: semver,
        status: unaffected, 
        changes: [
            {at: 5.2, status: affected},
            {at: 5.4, status: unaffected}
        ]
    }
]

Opaque versions with patch lines

versions: [
    {
        version: 3.0, limit: 3.0-*, range: patch,
        status: unaffected, 
        changes: [
            {at: 3.0-patch-C, status: affected},
            {at: 3.0-patch-E, status: unaffected}
        ]
    },
    {
        version: 4.0, limit: 4.0-*, range: semver,
        status: unaffected, 
    },
    {
        version: 5.0, limit: 5.0-*, range: semver,
        status: unaffected, 
        changes: [
            {at: 5.0-patch-A, status: affected},
            {at: 5.0-patch-C, status: unaffected}
        ]
    }
]

Notes:

The version string used in a ‘version’ or ‘limit’ or ‘at’ field should uniquely identify a particular version, so that it is easy to copy these to other places and not lose important context. So, for example, I’ve used “3.0-patch-C” instead of just “patch-C”.

Git versions

We also want this to work well for version control revision information. Here is a simplified version of the Linux bug:

The bug was introduced in commit 1234, which was first released in v3.16. It was later fixed twice, in 4567 which landed in v4.19.198 and in 6789 which landed in v5.13.4.

We can represent this situation with:

versions: [
    {
        version: 3.0, limit: 3.*, range: linux,
        status: affected, 
    },
    {
        version: 4.19, limit: 4.19.*, range: linux,
        status: affected,
        changes: [{at: 4.19.198, status: unaffected}]
    },
    {
        version: 5.13, limit: 5.13.*, range: linux,
        status: affected,
        changes: [{at: 5.13.4, status: unaffected}]
    },
    {
        version: 1234, range: git,
        repo: https://github.com/torvalds/linux, 
        status: affected,
        changes: [
            {at: 4567, status: unaffected},
            {at: 6789, status: unaffected}
        ]
    }
]

The last version object describes the precise git commit ranges. Anything after hash 1234 is affected, except that commits starting at 4567 and at 6789 (on different branches) are unaffected. This makes clear that future extensions of the v4 and v5 branch are unaffected, while commit 7890 is still affected. This encoding is the way most vulnerabilities with a single introduction but multiple branched fixes would encode the version control graph.

For the specific case of Linux, the UVI project wants to treat vulnerabilities on different kernel version branches as completely different vulnerabilities, as a matter of policy, essentially treating different kernel versions as different products. (Although I think this is a mistake in this case, perhaps there are other contexts where it makes sense, so it’s worth examining how to do it.)

The obvious encoding is to write this in the vulnerability entry for the v4 “product”:

versions: [
    {
        version: 1234, range: git,
        status: affected,
        changes: [{at: 4567, status: unaffected}]
    }
]

And this for the vulnerability entry for the v5 “product”:

versions: [
    {
        version: 1234, range: git,
        status: affected,
        changes: [{at: 6789, status: unaffected}]
    }
]

The problem with this pair of vulnerability entries is that according to the v4 entry, 6789 is affected, and according to the v5 entry, 4567 is affected. So every kernel commit after 1234 is going to appear to be affected by at least one of these entries. Again, that’s the right default behavior: in the complete version in the previous example, we definitely want to identify 7890, on an unfixed branch, as affected. The problem here is that v5 appears to be an “unfixed branch” for the v4 vulnerability, and vice versa.

We can fix this problem by using limit (just like above) to limit the effect to a single branch. In this case, a limit L for a git range would mean the range only applies to commits that are on the branch leading to L (meaning they are parents of L). This is the same “only before” meaning of limit as in the semver limits.

That is, we can write:

versions: [
    {version: 1234, limit: 4567, range: git, status: affected},
    {version: 4567, range: git, status: unaffected},
]

and

versions: [
    {version: 1234, limit: 6789, range: git, status: affected},
    {version: 6789, range: git, status: unaffected},
]

This form has the downside of not making clear that 7890 and other off-v4, off-v5 commits are affected, which is why I think UVI’s policy is a mistake. But if that is the policy someone needs to encode, then the new limit field provides a way to do that.

rsc commented 3 years ago

I have posted the schema pull request for reference, but discussion is probably better here than on the PR.

ElectricNroff commented 3 years ago

One concern about this timeline event model is that there's a race condition involving relevant anonymous events. This is perhaps hard to explain, so I've started with examples. I've also suggested a small change that can fix the problem in, at least, some realistic situations. The change is to stop hardcoding 'return "unknown"' at the end of the algorithm, and let the author of the CVE Record choose to return whatever valid status they want. I feel that this will make data entry easier and less error-prone, and probably increase the number of data providers willing to provide computable information.

Currently, a typical versions key can have:

versions: [
     {
        version: 3.0, limit: 3.*, range: semver,
        status: affected,
        changes: [{at: 3.4, status: unaffected}]
     },
     {
        version: 4.0, limit: 4.*, range: semver,
        status: affected,
     },
     {
         version: 5.0, limit: *, range: semver,
         status: affected,
         changes: [{at: 5.2, status: unaffected}]
     }
]

The small change is to put the array of entries inside an object:

versions: { default: myDefault, entries: 
[
     {
        version: 3.0, limit: 3.*, range: semver,
        status: affected,
        changes: [{at: 3.4, status: unaffected}]
     },
     {
        version: 4.0, limit: 4.*, range: semver,
        status: affected,
     },
     {
         version: 5.0, limit: *, range: semver,
         status: affected,
         changes: [{at: 5.2, status: unaffected}]
     }
]
}

Also, the bottom of the algorithm changes from:

return "unknown"

to:

if versions.default is present
    return versions.default
else
    return "unknown"

For example, consider the following realistic scenario. A vulnerability is being announced although no fix is yet shipping. The data provider knows the exact status of every version that has ever existed. Specifically, the vulnerability announcement states that 2.8.0 and later 2.8.x versions are affected, 3.0.0 and later 3.0.x versions are affected, and no others are (or will be) affected. It also states that a fix will be available later, and will be shipped with a version number of either 3.1.0 or 4.0.0 (those are the only two possibilities; it just depends on whether there will be an incompatible API change). Furthermore, it states that no more 2.x versions will be shipped (that series ended at 2.8.x) and no more 3.0.x versions will be shipped. Finally, it states that the fix (in either 3.1.0 or 4.0.0) will be effective going forward, because the entire problematic code component is being removed.

Apparently this could be expressed as:

versions: [
     {
        version: 0.0.0, limit: *, range: semver,
        status: unaffected,
        changes: [{at: 2.8.0, status: affected}, {at: 3.1.0, status: unaffected}]
     }
]

(or in less compact ways that have the same downsides). To express this, it was necessary to refer to two versions that may or may not be real (0.0.0 and 3.1.0). The algorithm always produces correct results. However, the CVE Record data is hard for a human to produce (they need to reason about the algorithm before ultimately deciding that those unconfirmed version numbers - 0.0.0 and 3.1.0 - are the best way forward). The CVE Record data is also potentially misleading to later human readers, who might think it implies that 3.1.0 was released even if the developers had actually decided to go with 4.0.0 instead of 3.1.0. Also, the SemVer specification is ambiguous about whether there is a reasonable way (such as 0.0.0) to express a lower bound (it says "The simplest thing to do is start your initial development release at 0.1.0" and 0.0.0-alpha is also a valid choice).

With the proposed change, the data provider can simply write:

versions: { default: unaffected, entries: 
[
     {
        version: 2.8.0, limit: 2.8.*, range: semver,
        status: affected,
     },
     {
        version: 3.0.0, limit: 3.0.*, range: semver,
        status: affected,
     }
]
}

Here, regardless of whether the fix is shipped in 3.1.0 or 4.0.0, the data provider has no need to ever update the CVE Record. The CVE Record only refers to real versions. It is simple to reason that this is a correct data representation for the algorithm.

To align this with the terminology introduced at the beginning of this comment:

"a fix will be available later, and will be shipped with a version number of either 3.1.0 or 4.0.0" is an anonymous event. The existence of this event is clearly relevant to the end of the 3.x affected series, but we don't yet know whether it's going to be named an "at 3.1.0 event" or named an "at 4.0.0" event.
It is, of course, completely normal for a CNA to publish a CVE Record before the fix is shipped, and for end users to begin to do vulnerability assessment on the basis of that CVE Record.
Now, one might argue that the anonymous event isn't relevant to these end users. Neither 3.1.0 nor 4.0.0 exists yet, and thus changes: [{at: 2.8.0, status: affected}] is sufficient for vulnerability assessment. Anyone running 2.8.0 or any later version is vulnerable at this point in the release cycle.
This, however, has a race condition between the software release process and the CVE Record update process. The people publishing a release (and the customers updating to that new release) might be much more diligent than the person maintaining the CVE Record, with the result that thousands of customers will get false positives starting from the day that the new release is published. Thus, it's a bad idea to ever have changes: [{at: 2.8.0, status: affected}] at the end of the timeline.
The CVE Record author can force an "unknown" result for everything after 3.0.x, but that's really not much better than the false positive. End users want a result of "unaffected" as soon as they update to the fixed version.
Of course, the CVE Record author can work around this by guessing 3.0.1 as the name of the anonymous event, but that's confusing both on the producer side and on the consumer side. And, conceptually, that guessing is useful to nobody. The vulnerability-assessment facts are completely known in advance: only 2.8.x and 3.0.x versions are vulnerable.

This proposed "default" key also has important use cases for other status values (not only for "unaffected"). If the working group decides to add "unsupported" to the valid status values, then any data provider could choose "unsupported" as their default in any CVE Record, in contexts where other data providers may have relied on "unknown" instead. (For example, the data provider implicitly relied on "unknown" for version 2.0 in the example at the top of this comment.)

pombredanne commented 3 years ago

I was not part of the discussion, so this may feel off topic; my comments below may be entirely obvious to you; if so, please ignore this!

I came to appreciate that version ranges can only ever be an approximation; and that a complete enumeration of all affected versions is the only correct statement; this was based on insightful comments by @oliverchang and @rsc made elsewhere.

IMHO there is no such thing as a "computable version identification" that works in all cases.

One possible exception may be crypto-bound closed version ranges like commit hashes. In all other cases I can fathom, affected and unaffected versions can be inserted in a range after the fact; a range may be resolved correctly as intended today; it may be incorrect tomorrow when new versions may be squeezed in the range even with semver: we are mere humans releasing software and we may deviate at times from whatever clean version range scheme we say we are using.

Because of this --for a vulnerability database that I co-maintain-- we are evolving our vulnerability data structures to store:

a concrete enumeration/list of known affected versions
potential affected version range(s) and a versioning scheme such as "semver"

Both are optional, and the enumeration is the only thing that is certain.

The ranges are hints for tools and humans to re-evaluate and update the concrete affected versions such as when there are new releases of the package or product at hand. And when this re-evaluation or review happens this can lead to:

the update of enumerated versions
the update of the version ranges proper

In practice, when there is a new version that is in an affected range and not yet enumerated, this means that the version MAY BE affected, short of other info. Until tested (by tools or by humans, fuzzing, code analysis or else) that's the best that can be said; and when tested, a version becomes enumerated.

I am suggesting using a similar approach and stop trying to make version ranges first class concepts. Rather:

use by default a simple enumeration of all affected versions, e.g. a simple array of strings.
also store version ranges as needed as hints to tools and humans to assist them with the processing of versions that are not yet enumerated

In this approach, it is OK to have no enumeration when we do not know yet (and for the shy vendor that does not want to disclose).

When users are reviewing vulnerabilities in their list of (package|products)/versions, they can get two bits of information:

a list of concrete vulnerable versions and the version(s) where the the fix is applied
a list of potential vulnerable versions not in the list 1. and resoled in the range(s)

If the ranges are treated as hints (and not mixed with the concrete resolved list of versions), it is still important to get their updated grammar and syntax right, but this could become a lesser issue as this would NOT be the primary, default way to get versions... but just a hint.

chandanbn commented 3 years ago

@pombredanne you are right. The aim here is to capture the hints in a way that is less ambiguous for tools and humans. There should be less room for misinterpretation with fewer false negatives and false positives.

For open-source projects with a public git repo, commit hashes, and tagged versions, an automated service can help generate (and refresh) a list of concrete vulnerable versions.

rsc commented 3 years ago

@pombredanne and @chandanbn, for what it's worth, I disagree that ranges are only human hints and can never be treated as precise by computers. It's true that you have to be careful to make them precise, and in particular you need to say what the numbering system is (versionType here) and have that system be well-defined. If it's not, then yes, the best you can do is an enumeration, perhaps sanity checked by a version control range.

In Go in particular (which uses semver numbering), it is possible to generate a semver version corresponding to each commit to a repo. It would not make sense to require a CVE to enumerate every single commit when a simple (and much shorter) range can be specified instead. But we could still have git ranges and semver ranges and cross-check the meaning of the semver ranges against the git ranges.

The required enumeration is also problematic for commercial software when a vendor wants to say "fixed in 5.2" and not enumerate all the prior versions that were affected. A range makes that easy to express. There may be no complete enumeration.

I agree that it can be a fine approach to do both the enumeration and the ranges and have some kind of automation to cross-check them - or a semver range and a git range, again cross-checked. That works especially well for open source. But I don't believe that approach can be required of every situation. (One thing I've come to appreciate from all these discussions is the sheer breadth of situations that CVE must be able to capture.)

rsc commented 3 years ago

@ElectricNroff, if the vendor has guaranteed all those things, I don't see a problem with the as-yet-nonexistent version 3.1.0 in:

versions: [
     {version: 0, limit: 3.0.*, range: semver, status: affected},
     {version: 3.1.0, limit: *, range: semver, status: unaffected}
]

Generally speaking, predicting the future is hard. Instead of layering additional ways to set down predictions about the future, it seems much better to make it easy for vendors to update their CVE records as new facts become known. After all, it is also true that customers may pressure the vendor to issue a fix in the 3.0 branch after all. No amount of encoding the future can account for actual changes to the expected future. Instead, we should make it easy for vendors to amend their CVE records. So it also seems fine if the vendor chooses to issue a CVE with:

versions: [
     {version: 0, limit: *, range: semver, status: affected}
]

and then amend the record later when fixes come out.

rsc commented 3 years ago

Changes in latest PR, based on Tuesday meeting discussion:

dropped mention of unsupported: that is a separate axis and can change over time in a way that affected/unaffected does not.
replaced 'limit' with two fields 'lessThan' and 'lessThanOrEqual', so that a range can choose whether to be half-open or not.
added 'defaultStatus' in the product object, to address @ElectricNroff's concern as well as the use case of saying 'all versions are unaffected except for the following enumeration of affected versions (not using any ranges at all)'.

rsc commented 3 years ago

Latest commit message summary:

The shorthand version of this schema is:

defaultStatus: $status
versions: [{
    version: $version
    status: $status  // unknown, affected, unaffected

    versionType: string (‘semver’, ‘git’, ..., to define meaning of <)
    repo: string (optional, intended for versionType ‘git’)
    lessThan/lessThanOrEqual: $version (can use * for “infinity” aka "maxuint")
    changes: [{
        at: version where status changes
        status: ...
    }]
}]

An object in the versions list can be either:

a simple {version: V, status: S}, which indicates the status of the single version V.
a range {version: V, versionType: T, lessThan: L OR lessThanOrEqual: LE, status: S, changes: C}, which indicates the status of the half-open interval [V, L) or closed interval [V, LE]. The range starts with V having status S and then changes over time according to the events listed in C.

The algorithm for deciding the status of a particular version V is then:

for entry in product.versions {
    if entry.lessThan is not present and entry.lessThanOrEqual is not present and v == entry.version {
        return entry.status
    }
    if (entry.lessThan is present and entry.version <= v and v < entry.lessThan) or
       (entry.lessThanOrEqual is present and entry.version <= v and v <= entry.lessThanOrEqual) {
        status = entry.status
        for change in entry.changes {
            if change.at <= v {
                status = change.status
            }
        }
        return status
    }
}
return product.defaultStatus

Fixes #87. Fixes #12. Fixes #77.

rsc commented 3 years ago

I also added 'custom' as a versionType that is not directly computable without further information. That will be necessary for upconverting the JSON 4.0 data.

chandanbn commented 3 years ago

If we are adding lessThan and lessThanOrEqual to allow up-converting <=, do we need a versionAfter to allow up-converting >? I feel we are complicating the structure for backwards compatibility.

rsc commented 3 years ago

I think it is probably important to rename limit to lessThan for clarity. I don't have a strong opinion on adding lessThanOrEqual or not: I will defer to you and others who understand how much weight to give up-converting issues.

I do observe that > is significantly less common in the 4.0 data than <=.

% cd cvelist
% git grep -E -h '"(version_)?affected"' | 
    sed 's/version_//; s/[  ][  ]*/ /g; s/,//' | 
    sort | 
    uniq -c | 
    sort -nr
12965  "affected": "<"
10495  "affected": "="
2608  "affected": "<="
1104  "affected": ">="
 298  "affected": "!>="
 211  "affected": "!=>"
 149  "affected": "!"
  98  "affected": "?>"
  82  "affected": "!<"
  42  "affected": "?"
  32  "affected": ""
  26  "affected": "?<="
  21  "affected": ">"
  11  "affected": "!>"
   9  "affected": "undefined"
   8  "affected": "?<"
   4  "affected": "=>"
   3  "affected": "2021.1.7316"
   3  "affected": "2021.1.7149"
   3  "affected": "2020.6.5146"
   3  "affected": "!<="
   2  "affected": "1.09"
   1  "affected": "?>="
   1  "affected": "=6.3.x"
   1  "affected": "<=7.1.3.1"
   1  "affected": "2020.6.4671"
   1  "affected": "2018.9.17"
   1  "affected": "10.16.3"
   1  "affected": "0.9"
   1  "affected": "!=<"
%

I spot-checked the "?>" entries and all the ones I looked at were Jenkins plugins that used the form:

                                        {
                                            "version_value": "1.8",
                                            "version_affected": "<="
                                        },
                                        {
                                            "version_value": "1.5.2",
                                            "version_affected": ">="
                                        },
                                        {
                                            "version_value": "1.8",
                                            "version_affected": "?>"
                                        }

The ?> could be dropped here since unknown would be the default anyway after saying affected in the range [1.5.2, 1.8] (using lessThanOrEqual).

I also looked at the > entries and many of them appear to be bugs. For example CVE-2021-0253 says

                                    {
                                        "platform": "NFX Series",
                                        "version_affected": ">",
                                        "version_name": "19.4",
                                        "version_value": "19.4R3"
                                    },

but https://kb.juniper.net/InfoCenter/index?page=content&id=JSA11146&actp=METADATA says clearly "19.4R3 and above", so this should be ">=".

So it does not seem like the case for versionAfter is anywhere near as strong as lessThanOrEqual.

chandanbn commented 3 years ago

Thank you for the stats! The numbers for >, !>, ?> are small enough they can be flagged for up-conversion by hand. We don't need a versionAfter. The numbers for lessThanOrEqual are significant but smaller. If they are coming from a few CNAs (and if they can fix it at the source), then we can consider it deprecated - slated for removal in the future.

ElectricNroff commented 3 years ago

In JSON 4, "version_affected": "<=" implies that, somewhere on the timeline after version_value, an event occurs such that the status is no longer asserted to be "affected" - and "unaffected" and "unknown" are both plausible post-event statuses. Here, "the timeline" is used to mean any of the mechanisms for entering version data, e.g., changes, version, or lessThan. The argument for lessThanOrEqual in JSON 5 is:

there are thousands of affected CVE records
without lessThanOrEqual, upconversion has two anomalies:
- all that we know about the event's position on the timeline is that it's greater than version_value; we don't know what version_value+1 means for every versionType.
- within the context of the CVE Record data alone, the post-event status is ambiguous
there may be no volunteers who can determine all of the correct post-event statuses before the deadline (November 2021)
if upconversion always chooses "unaffected" or always chooses "unknown" for the post-event status, then it destroys data that another entity may be relying on, because they use a different method to estimate what <= means

For this last point, another entity (e.g., a commercial vulnerability-assessment product) may currently be relying on https://github.com/CVEProject/cvelist to deliver computable data to its own constituents, e.g., with a more complex algorithm such as:

        switch assigner {
        case
                "contact@wpscan.com":
                fmt.Println("the post-event status is unknown")
        case
                "cna@mongodb.com",
                "psirt@adobe.com",
                "psirt@paloaltonetworks.com",
                "security@tibco.com":
                fmt.Println("the post-event status is unaffected")
        default:
                fmt.Println("the event's meaning is unspecified")
        }

If upconversion always maps <= to the same post-event status, then it's impossible for that entity (using only the JSON 5 document set) to deliver the data quality that they previously delivered. Also, having them continue to use the JSON 4 document set forever isn't a good solution because, starting sometime in 2022, the JSON 4 document set will reach end-of-life.

Examples:

https://github.com/CVEProject/cvelist/blob/master/2021/24xxx/CVE-2021-24474.json and https://github.com/CVEProject/cvelist/blob/master/2021/24xxx/CVE-2021-24142.json might imply that contact@wpscan.com always uses <= to mean the vulnerability wasn't (yet) fixed, and always uses < to mean the vulnerability was fixed
https://github.com/CVEProject/cvelist/blob/master/2021/20xxx/CVE-2021-20328.json might imply that cna@mongodb.com always uses <= to mean the vulnerability was fixed in the next version
https://github.com/CVEProject/cvelist/blob/master/2021/28xxx/CVE-2021-28546.json might imply that psirt@adobe.com always uses <= to mean the vulnerability was fixed in the next version
https://github.com/CVEProject/cvelist/blob/master/2021/3xxx/CVE-2021-3033.json might imply that psirt@paloaltonetworks.com always uses <= to mean the vulnerability was fixed in the next version
https://github.com/CVEProject/cvelist/blob/master/2021/28xxx/CVE-2021-28817.json might imply that security@tibco.com always uses <= to mean the vulnerability was fixed in the next version

The situation may be less consistent when:

a CNA (except for contact@wpscan.com) produces <= data about products that it doesn't directly control, e.g.,
- cve@cert.org.tw
- cve@rapid7.com
- ics-cert@hq.dhs.gov
- info@cert.vde.com
- vulnerabilitylab@whitesourcesoftware.com
the CNA has many persons who are producing <= data for subparts of the CNA's scope (e.g., security@apache.org)

tcullum-rh commented 3 years ago

FWIW, I think that the content/diagrams in the introductory slides above should at least be referenced somewhere in the docs for the version array or in whatever User Guide we eventually create. The visualizations are very important to aid in understanding what is being done here, and understanding is important to proper usage.

I generated some docs using json-schema-for-humans, which generates HTML docs based off of those descriptions. I'm still not confident that the majority of CNAs will understand the implications behind all of that from those schema descriptions alone.

chandanbn commented 3 years ago

@ElectricNroff Summarizing your concern there are many CVE entries that simply have information like CVE affects versions before v1, before v2, and before v3 and nothing else (no version group, no starting points, no affirmative not-affected statements). In those cases:

defaultStatus: unaffected
versions: [{
    version: '0' // do we need a fist ever indicator? empty string, 0 or * ?
    status: affected
    versionType: semver if versions match sermver pattern, else custom.
    lessThan: '*'
    changes: [{
        at: v1    status: unaffected
        at: v2    status: unaffected
        at: v3    status: unaffected
    }]
}]

alternatively:

defaultStatus: unaffected
versions: [{
    version: '0'
    lessThan: v1
    status: affected
    versionType: semver if versions match sermver pattern, else custom.
},{
    version: '0'
    lessThan: v2
    status: affected
    versionType: semver if versions match sermver pattern, else custom.
}{
    version: '0'
    lessThan: v3
    status: affected
    versionType: semver if versions match sermver pattern, else custom.
}]

The entries were not computable in v4, and they will not be computable in v5. IMHO that is acceptable as this bug/pull request is not about making previously uncomputable info into computable. The CNAs now have better ways to express the same information.

update: defaulStatus is set to unaffected. That gives the expected results.

ElectricNroff commented 3 years ago

JSON 4 data that says "before" (aka the < comparison) isn't one of the hardest cases. JSON 4 data that says <= (sometimes expressed as "through v#.#.#") is a hard one. Also, I don't think either of your options for "before" would typically be used. Adjacent entries on an "at" timeline should have different statuses. Also, multiple entries of version zero and the same status can be replaced by the one entry with the highest limit (i.e., the v3 one). If the available data is that versions before 1.7.3, before 2.3.9, and before 3.2.1 are affected, then there are three upconversion options that may be reasonable choices:

use defaultStatus = unknown; and guess that 1, 2, and 3 are the starting points of different affected ranges, with lessThan values of 1.7.3, 2.3.9, and 3.2.1 respectively; also, list 3.2.1 through infinity as an unaffected range
use defaultStatus = unknown; list 1.7.3 and 2.3.9 as simple version numbers with the unaffected status; and list 3.2.1 through infinity as an unaffected range
use defaultStatus = unknown; and list 1.7.3, 2.3.9, and 3.2.1 as simple version numbers with the unaffected status

Of course, only the third option can be error-free. The third option can often work well for CVE consumers who use the CVE Record data very soon after it's published (e.g., before the vendor has an opportunity to release 3.2.2). This scenario applies to CNAs who will continue to use that < data pattern in their JSON 4 documents that are published after CVE Services 2.0 has launched.

rsc commented 3 years ago

@chandanbn I think you meant 'defaultStatus: unaffected' throughout https://github.com/CVEProject/cve-schema/issues/87#issuecomment-906584822

rsc commented 3 years ago

@ElectricNroff, with both lessThan and lessThanOrEqual as options, along with the defaultStatus we added at your earlier suggestion, it looks to me like essentially all the JSON 4 data can be encoded faithfully. There is a question of what to do with entries that don't explicitly say "version X and above are unaffected", but that's a question for the converter: whatever the answer should be, it can be encoded precisely and clearly.

I can't quite tell: is your last comment arguing in favor of lessThanOrEqual, or are you saying that something else is needed as well?

ElectricNroff commented 3 years ago

I feel that the current design (e.g., with defaultStatus, lessThan, and lessThanOrEqual) is adequate, but that (when reasonably achievable) the upconverter should avoid adding explicit assertions that weren't present in the JSON 4 data.

For example, from the perspective of the algorithm used by the CVE Program, these two (which could be chosen for <= 3.2.1 in JSON 4 data) are exactly equivalent:

defaultStatus: unknown
...
versions: [
     {
        version: 0, lessThanOrEqual: 3.2.1, versionType: semver,
        status: affected
     }
]

defaultStatus: unknown
...
versions: [
     {
        version: 0, lessThanOrEqual: 3.2.1, versionType: semver,
        status: affected
     },
     {
        version: 3.2.2, lessThan: *, versionType: semver,
        status: unknown
     }
]

The reason that the first one is preferable is that a different entity (e.g., a commercial vulnerability-assessment product) may have the resources to develop their own algorithm that replaces:

return product.defaultStatus

with something like:

if ((version array has a length of 1 and contains lessThanOrEqual) and cveMetadataPublished.assigner == theAdobeUuid) {
    return unaffected
}
return product.defaultStatus

if their customers demand that (and if Adobe was unwilling to change the data).

In other words, immediately before the "return product.defaultStatus" line is a hook point that third parties can use to insert their own code. In an actual use case, the third party would have to start from the algorithm pseudocode and implement a modified version on their own. The CVE Program isn't planning to package the algorithm as a standalone software product (and, even if it did, the product wouldn't ship with a supported extension framework).