AppThreat / vulnerability-db

Vulnerability database and package search for sources such as Linux, OSV, NVD, GitHub and npm. Powered by sqlite, CVE 5.0, purl, and vers.
MIT License
93 stars 22 forks source link

Simplify `vers` ranges #132

Open nscuro opened 5 months ago

nscuro commented 5 months ago

The cve_index table currently contains many vers ranges with redundant constraints.

image

As per vers specification:

These pairs of contiguous constraints with these comparators are redundant and invalid (ignoring any != since they can show up anywhere):

  • =, < or <= followed by < or <=: this is the same as < or <=
  • > or >= followed by =, > or >=: this is the same as > or >=

The algorithm to simplify is documented in the spec: https://github.com/package-url/purl-spec/blob/version-range-spec/VERSION-RANGE-SPEC.rst#version-constraints-simplification

I believe the univers library supports simplification as well.

Looking at some of the constraints, it makes me wonder if perhaps the current representation is a result of attempting to split ranges into pairs of 2 (as demonstrated in the spec)?

In that case, care needs to be taken to not break the meaning of range by erroneously omitting the corresponding lower or upper bounds. Sadly, an algorithm for splitting is not part of the spec, but @sahibamittal implemented one in versatile recently (tests for verification here).

prabhu commented 5 months ago

@nscuro we had the opposite problem with normalization and implementing a version comparison logic that can work with both normalized and non-normalized versions of vers. In the end, we kept things simple to simply duplicate the constraints, so that each vers is guaranteed to be correct.

nscuro commented 5 months ago

Thanks for the clarification.

Is the problem that univers doesn't support all required versioning schemes for the package types that the database pulls in?

Or is it that even those supported by univers are breaking due to garbage data / unconventional versions in the vuln feeds?

Trying to understand the source of the problem, because if vulnerability-db has these issues while building the database, consumers of it will have the same problems eventually. A majority of functionality in vers is bound to the requirement that versions can be compared and sorted.

prabhu commented 5 months ago

@nscuro I have not tried univers yet. The issue is that some of our sources result in a list of tuples where each tuple is min, max, min_excluding, and max_excluding. We have not come up with a logic to convert such a list of tuples to a single normalized vers where even the list could be unsorted. Other than multiple rows what problem do you foresee with this implementation?

nscuro commented 5 months ago

We have not come up with a logic to convert such a list of tuples to a single normalized vers where even the list could be unsorted.

Ideally this would be done by univers. In the simplest case, you should be able to simply "dump" all constraints into a single vers range, and then let univers take care of sorting and simplification. We have a few examples of conversions from GHSA, NVD, and OSV notations here.

Do you have some examples of tuples handy where this would not be feasible?

Other than multiple rows what problem do you foresee with this implementation?

Less a problem, more of an idealistic view that the vulnerability database should provide normalized data instead of having each client perform normalization. After all, it's required to evaluate a range. If the database is already unable to perform this task, the chance of consumers failing to do it is high.

prabhu commented 5 months ago

@nscuro, different sources could be describing the same component, but offering slightly different versions. It will require database UPDATE calls which will be quite slow, since the indexes are added only at the end of the creation process.

prabhu commented 5 months ago

@nscuro, Here is an example I came across where normalization is quite difficult since the version ranges across the sources do not agree.

vers-normalize

nscuro commented 5 months ago

In that case it might make sense to track the source for each range, so we can attribute wrong ranges to where they came from. And then potentially apply manual corrections for those sources.

How do you handle that data right now in depscan? If you were to consider all rows, you'd be getting quite a few FPs at the moment.

prabhu commented 5 months ago

@nscuro, yup. This is the idea behind an attribute called matching_vers in search.py. But, of course, it is up to the consuming application to decide on the logic.

depscan currently use the old VDB5 database and then filters the results based on CVE IDs alone. Perhaps, the vulnerabilities array could be used to present the data from all the sources?