Open JonoYang opened 1 month ago
An idea to speed things up would be to perform the package assembly step in memory by creating a commoncode.resource.Codebase
object and using that instead of the Project
.
@JonoYang Another idea would be to have an option to skip entirely the package assembly when this is not needed, say to populate the PurlDB afterwards.
The problem when we don't run the package assembly step is that DiscoveredPackages are not created (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/pipes/scancode.py#L470). We are failing some tests because we are not creating the top level package (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/tests/test_pipelines.py#L798)
I am running the
inspect_packages
pipeline on a very large codebase that is an npm package. The pipeline takes a very long time at thescan_for_application_packages
step on the package assembly portion of package scanning (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/pipes/scancode.py#L459), where we are running this code (https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/npm.py#L82)Since this codebase is an npm package, we pretty much consider all files in it to be part of the npm package. We are running code that is originally intended for a scancode-toolkit codebase. Walking a codebase using sctk code on a scio codebase is not performant because each call to a sctk codebase traversal method is an individual query to the database. The methods here are called multiple times and
.save()
is called on each Resource when performing package assembly.