aboutcode-org / vulnerablecode

A free and open vulnerabilities database and the packages they impact. And the tools to aggregate and correlate these vulnerabilities. Sponsored by NLnet https://nlnet.nl/project/vulnerabilitydatabase/ for https://www.aboutcode.org/ Chat at https://gitter.im/aboutcode-org/vulnerablecode Docs at https://vulnerablecode.readthedocs.org/
https://public.vulnerablecode.io
Apache License 2.0
519 stars 190 forks source link

Improve VCIO bulk API package lookup performance #1561

Open pombredanne opened 3 weeks ago

pombredanne commented 3 weeks ago

From https://github.com/aboutcode-org/dejacode/issues/94#issuecomment-2298445423 by @tdruez

Could you tell me the PURL types from the list that are not supported (no data available) by VCIO? Excluding those will reduce the number of "useless" requests to the API. ['gem', 'autotools', 'sourceforge', 'bitbucket', 'rpm', 'gitlab', 'cran', 'windows-program', 'docker', 'bower', 'nuget', 'generic', 'cargo', 'npm', 'deb', 'golang', 'maven', 'composer', 'pypi', 'hackage', 'unknown', 'rubygems', 'about', 'github']

Well, for example we have ±300,000 sourceforge PURL in the nexB Dataspace, doing lookup for those is a total waste of time and resources.

More context: For ±133,000 packages in the nexB Dataspace, it currently takes about 1h and 2,674 HTTP requests made to the VCIO API.

The result is only 1,235 vulnerabilities fetched and created. Seems like there's a lot of wasted time and resources with our current approach.

I suggest these progressive steps:

tdruez commented 3 weeks ago

From https://github.com/aboutcode-org/dejacode/issues/94#issuecomment-2298761954

@pombredanne Thanks, this sounds like it will require some work to make this happen.

In the short term, could VCIO expose a new "action" on the package endpoint to get this list of supported types? (Should be a very small and fast query) On the DejaCode side, the process could start with fetching the available types to get a QuerySet limited to those and drastically reduce the number a queries.

>>> unique_types = Package.objects.values_list("type", flat=True).distinct()
>>> unique_types
<PackageQuerySet ['about', 'cargo', 'cocoapods', 'composer', 'deb', 'github', ...
tdruez commented 3 weeks ago

Another examples that takes over a minute to load: https://public.vulnerablecode.io/api/vulnerabilities?vulnerability_id=VCID-j2zf-12g6-aaag

pombredanne commented 1 day ago

We need to change what we return API data entirely, in a new endpoint that does not provide all the package details in a vulnerability. We care about packages 1st, and less about vulnerabilities, so when querying by vulnerability, we should not serialize so much package data.

TG1999 commented 1 day ago

Related: https://github.com/aboutcode-org/vulnerablecode/issues/1572