aboutcode-org / purldb

Tools to create and expose a database of purls (Package URLs). This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ and nexB for https://www.aboutcode.org/ Chat is at https://gitter.im/aboutcode-org/discuss
https://purldb.readthedocs.io/
35 stars 23 forks source link

Overhaul needed for Package scan requests and indexing #49

Closed JonoYang closed 7 months ago

JonoYang commented 1 year ago

I've set up purldb and scancode.io locally, where I run make run_visit and make run_map to visit and map Maven packages. I then run make request_scans and make process_scans to get information on the Resources in the packages we visit and map. I've noticed that the scan requests we send off to scancode.io are for multiple versions of the same package. This causes a few problems:

For the first two issues, we need to come up with a new way to group and index fingerprints. A starting idea would be to come up with a bit more general. Currently, we create directory fingerprints for every package we map. If two packages we index are the same package but different versions, then we may have the same fingerprints twice. We could do something along the lines of indexing fingerprints to a package in general, rather than to a specific package version.

For the second issue, we will have to flip the current scan queue request model. purldb will have a queue of packages that it wants scanned and it will be up to scancode.io to poll purldb to see what needs to be scanned. scancode.io would poll purldb, get the package that needs to be scanned, scan and fingerprint it, then send the results back to purldb. This issue is tracked at https://github.com/nexB/purldb/issues/14

JonoYang commented 1 year ago

There are my notes for an on-demand queue for requesting package information. This queue would take in a package URL, and the queue would figure out which handler works for the package url, and then look at the upstream repo for package information.

This is a queue that allows us to:

How does (should) it work:

How should we start:

https://repo1.maven.org/maven2/yom/yom/1.0-alpha-1/ https://repo1.maven.org/maven2/<namespace>/<name>/<version>/<name>-<version>-<classifier>.jar

We would first create a new Package in the packagedb with the initial information from maven, then we would download the sources jar of the package (if available), create a Package and index it. We would also have to do some sort of summarization (maybe use the license clarity score plugin?) on the scanned sources jar in order to get a copyright and license.

DennisClark commented 1 year ago

@JonoYang all this looks great! Question please: is the "ResourceURI" the same concept as the "Inferred URL" that we can see in scan results? or does it include other cases? (I guess I am asking for a somewhat precise definition of a ResourceURI.)

JonoYang commented 1 year ago

@DennisClark

ResourceURI is different than inferred URLs that are generated from purls.

ResourceURI is a model that represents a Resource from the internet you can download. The value of these URIs are usually the download URLs for a Package. In special cases, URIs could point to an upstream repo's index, like https://repo1.maven.org/maven2/.index/nexus-maven-repository-index.gz or https://replicate.npmjs.com/registry/_changes?include_docs=true&limit=1000&since=0. These are called seed URIs.

In purldb, we have visitors and mappers that work on these ResourceURIs. Visitors visit the seed URIs and create ResourceURIs for the packages it finds listed from the repo index. Mappers take the ResourceURIs created by visitors and creates new entries in the PackageDB for each ResourceURI.

DennisClark commented 1 year ago

@JonoYang thanks, that all makes sense!

JonoYang commented 7 months ago

We have updated the scan queue in https://github.com/nexB/purldb/issues/285