`inspect_packages` pipeline takes a long time to run

JonoYang commented 1 month ago

I am running the inspect_packages pipeline on a very large codebase that is an npm package. The pipeline takes a very long time at the scan_for_application_packages step on the package assembly portion of package scanning (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/pipes/scancode.py#L459), where we are running this code (https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/npm.py#L82)

Since this codebase is an npm package, we pretty much consider all files in it to be part of the npm package. We are running code that is originally intended for a scancode-toolkit codebase. Walking a codebase using sctk code on a scio codebase is not performant because each call to a sctk codebase traversal method is an individual query to the database. The methods here are called multiple times and .save() is called on each Resource when performing package assembly.

JonoYang commented 1 month ago

An idea to speed things up would be to perform the package assembly step in memory by creating a commoncode.resource.Codebase object and using that instead of the Project.

pombredanne commented 1 month ago

@JonoYang Another idea would be to have an option to skip entirely the package assembly when this is not needed, say to populate the PurlDB afterwards.

JonoYang commented 1 month ago

The problem when we don't run the package assembly step is that DiscoveredPackages are not created (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/pipes/scancode.py#L470). We are failing some tests because we are not creating the top level package (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/tests/test_pipelines.py#L798)

aboutcode-org / scancode.io

`inspect_packages` pipeline takes a long time to run #1398