aboutcode-org / vulnerablecode

A free and open vulnerabilities database and the packages they impact. And the tools to aggregate and correlate these vulnerabilities. Sponsored by NLnet https://nlnet.nl/project/vulnerabilitydatabase/ for https://www.aboutcode.org/ Chat at https://gitter.im/aboutcode-org/vulnerablecode Docs at https://vulnerablecode.readthedocs.org/
https://public.vulnerablecode.io
Apache License 2.0
543 stars 201 forks source link

Improve VCIO bulk API package lookup performance #1561

Open pombredanne opened 3 months ago

pombredanne commented 3 months ago

From https://github.com/aboutcode-org/dejacode/issues/94#issuecomment-2298445423 by @tdruez

Could you tell me the PURL types from the list that are not supported (no data available) by VCIO? Excluding those will reduce the number of "useless" requests to the API. ['gem', 'autotools', 'sourceforge', 'bitbucket', 'rpm', 'gitlab', 'cran', 'windows-program', 'docker', 'bower', 'nuget', 'generic', 'cargo', 'npm', 'deb', 'golang', 'maven', 'composer', 'pypi', 'hackage', 'unknown', 'rubygems', 'about', 'github']

Well, for example we have ±300,000 sourceforge PURL in the nexB Dataspace, doing lookup for those is a total waste of time and resources.

More context: For ±133,000 packages in the nexB Dataspace, it currently takes about 1h and 2,674 HTTP requests made to the VCIO API.

The result is only 1,235 vulnerabilities fetched and created. Seems like there's a lot of wasted time and resources with our current approach.

I suggest these progressive steps:

tdruez commented 3 months ago

From https://github.com/aboutcode-org/dejacode/issues/94#issuecomment-2298761954

@pombredanne Thanks, this sounds like it will require some work to make this happen.

In the short term, could VCIO expose a new "action" on the package endpoint to get this list of supported types? (Should be a very small and fast query) On the DejaCode side, the process could start with fetching the available types to get a QuerySet limited to those and drastically reduce the number a queries.

>>> unique_types = Package.objects.values_list("type", flat=True).distinct()
>>> unique_types
<PackageQuerySet ['about', 'cargo', 'cocoapods', 'composer', 'deb', 'github', ...
tdruez commented 3 months ago

Another examples that takes over a minute to load: https://public.vulnerablecode.io/api/vulnerabilities?vulnerability_id=VCID-j2zf-12g6-aaag

pombredanne commented 2 months ago

We need to change what we return API data entirely, in a new endpoint that does not provide all the package details in a vulnerability. We care about packages 1st, and less about vulnerabilities, so when querying by vulnerability, we should not serialize so much package data.

TG1999 commented 2 months ago

This is a related issue to restructure the API:

pombredanne commented 2 months ago

See a first PR to improve the results: