DependencyTrack / hyades

Incubating project for decoupling responsibilities from Dependency-Track's monolithic API server into separate, scalable services.
https://dependencytrack.github.io/hyades/latest
Apache License 2.0
58 stars 18 forks source link

Consider using VDB6 as a data source #1155

Open prabhu opened 5 months ago

prabhu commented 5 months ago

AppThreat vulnerability-db is an MIT-licensed database used by tools such as depscan for scanning. VDB6 is now available as a downloadable SQLite database. This data would help DT support containers, Linux OS, and some c/c++ with purl-based searches.

The easiest way to download the databases is using the ORAS cli tool.

oras pull ghcr.io/appthreat/vdbxz:v6
tar -xvf data.vdb6.tar.xz
tar -xvf data.index.vdb6.tar.xz

Use any sqlite browser tool to inspect and query the databases.

Proposed integration

Possible challenges

nscuro commented 5 months ago

Thanks for the suggestion @prabhu, will definitely have a look!

If we end up pulling in a pre-compiled / curated database, what would be great to have is the possibility of only fetching deltas. As in: "Only give me data that changed since I last checked". Having to pull in an entire blob of 1-N GB for only minor changes in the dataset will be expensive both on the network, but also on the processing side of things.

I know this is a tricky problem which may not work when distributing the data as SQLite. Did you look into this aspect before, by chance?

prabhu commented 5 months ago

@nscuro Thank you so much for looking into this.

Note the entire compressed database is only 188MB.

total 188M
-rw-r--r-- 1 prabhu prabhu  45M Mar 22 10:05 data.index.vdb6.tar.xz
-rw-r--r-- 1 prabhu prabhu 144M Mar 22 10:05 data.vdb6.tar.xz

Regarding the delta database, the larger database has a source_data_hash column in the future. I am happy to collaborate and improve this.

vdb6

nscuro commented 5 months ago

Apologies for the delay. The compression definitely is a good thing here, thanks for pointing it out!

Regarding the source_data_hash, what would be even more helpful would be a updated_at column. This way, when syncing with our internal database, we can drastically reduce the number of records we have to enumerate over. We could then do a SELECT ... WHERE updated_at > :lastSync.

Would that be a viable thing for VDB to add? I reckon it would require some sort of state-keeping between successive builds of the DB...

nscuro commented 4 months ago

Responding to my own question above, I think the point

Fork the vdb repo and publish vdb6 artefacts under DT org

from the issue description kinda covers that already. Essentially we can do the state-keeping and enrichment with updated_at ourselves, in our fork.

prabhu commented 4 months ago

@nscuro I will look into the updated timestamp to see if there is a way to expose it as a column. At this point, I am not sure if all the sources correctly update this timestamp and there are sources with no timestamps too, and hence went with the hash of the metadata.

Shall we explore alternatives to syncing the database like having a temp table for VDB6 or searching the sqlite directly for any hits from the index database?

prabhu commented 4 months ago

Another option is to use sqldiff to find the differing rows, but have not tried this command yet.

Update:

Download sqldiff from here - https://www.sqlite.org/download.html

To quickly find the summary

sqldiff --summary --table cve_data data.vdb6 data.vdb6.bak

cve_data: 1901635 changes, 0 inserts, 18060 deletes, 195753 unchanged

To create SQL update statements for only the changed rows. This took a few minutes for me.

sqldiff --table cve_data data.vdb6 data.vdb6.bak > out.sql
nscuro commented 4 months ago

sqldiff definitely looks closer to what we'd need.

At this point, I am not sure if all the sources correctly update this timestamp and there are sources with no timestamps too, and hence went with the hash of the metadata.

While the created/updated timestamps of the upstream sources are nice to have, for our use case we are more interested in when VDB6 updated a given entry. Say we fix a bug in how verses are assembled, how certain fields from upstream sources are parsed, or manual corrections are applied. Essentially we need to know when either the upstream data, or the VDB6 logic changed.

prabhu commented 4 months ago

@nscuro, can this be achieved by fixing the version in the pipeline here?

nscuro commented 4 months ago

Side note, the selection of ORAS clients is rather sparse right now. The library proposed in the issue description might work, but would pull in Kotlin as additional dependency. It's also fairly new with only a single maintainer.

Considering we won't need the full capabilities of ORAS, we should implement the "pull" functionality ourselves, without adding new dependencies. In the end it's just a HTTP API. Spec is here: https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pull

sahibamittal commented 4 months ago

Some of the observations I found :

prabhu commented 4 months ago

@sahibamittal, Thank you. Re (1), I am not a fan of epss, so unlikely to ever add support for it. For (2), we can enhance this code to accept a comma separated list of osv keys and create a new osv_url_dict which will get used subsequently.

nscuro commented 4 months ago

Inclusion of EPSS is something we could add as additional enrichment on our side.

@prabhu Any thoughts on resolving alias relationships? We did some research on this a while back, and found that alias data from some sources is wild west (mostly OSV), however data from GHSA is usually reliable. I'd assume the same to be true for Linux distro feeds.

Alias resolution is something that is easiest when all relevant data is present, so VDB6 is in a great position to make this happen as a post-build enrichment.

prabhu commented 4 months ago

@nscuro, interesting idea! Aliases are currently set in the description section for some sources. VDB tries to resolve the CVE id if available to reduce duplicates. But definitely an idea for a future enhancement.

prabhu commented 3 months ago

@nscuro, now that CVE 5.1 is released with support for purl, I am thinking of prioritizing VDB 6.1, which will use 5.1 schema with a couple of breaking changes. Additionally, we can support vulnrichment repo (auto-upgraded to 5.1 format).

Are you ok with parking this issue and revisit around September 2024?

nscuro commented 3 months ago

@prabhu Most certainly.