aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.1k stars 544 forks source link

Scanning multiple directories scans too much #3452

Open rspier opened 1 year ago

rspier commented 1 year ago

Description

Please leave a brief description of the bug or feature request:

How To Reproduce

Tell us how to reproduce the issue.

We have a giant third_party/ directory. GIANT! Trying to scan one package works fine. But trying to scan two at once, it scans things outside of those directories

$ scancode -n 129 --copyright --license --package --json /tmp/out.json  --max-in-memory 0 third_party/curl third_party/zlib
Setup plugins...
Collect file inventory...

It appears to hang there, but strace shows that it's actually scanning things outside of the curl and zlib directories, which will take a long time.

System configuration

For bug reports, it really helps us to know:

pombredanne commented 1 year ago

Ah, that's a flaw alright. When passing multiple input paths, I think that the current behaviour is to find the shared common root ancestor directory and "ignore" all parts that are not in the provided paths. That's a bad and stupid behaviour indeed.

pombredanne commented 1 year ago

@JonoYang @AyanSinhaMahapatra what do you think could be the way to improve this?

AyanSinhaMahapatra commented 1 year ago

@pombredanne there's the new paths you added to the Codebase model in https://github.com/nexB/commoncode/pull/42, instead of using the include plugin to handle multiple paths, can't we use this directly? Looking into this more.

rspier commented 12 months ago

I had some time to poke at this this afternoon, and it's not straightforward.

@pombredanne @AyanSinhaMahapatra Do you have any documentation on how paths is supposed to work. If I'm understanding properly, it's intended to be a set of subdirectories of the root (common_prefix) to filter to. On the surface, this seems more complicated than just iterating over multiple directories and concatenating the results. (So I'm trying to understand the rationale.)

It also looks like this isn't fully wired up yet. I started with commit 822cc91d895f1f, and started working through failures. There seem to be some mismatched assumptions about absolute vs relative paths and representation.

I went looking for tests for _create_resources_from_paths (which I think is where the main issues are), but there aren't any that look quite like what I'm looking for. (Although there are some for Codebase).

Anyway, wanted to reach out before I went any deeper...

Thanks-

pombredanne commented 11 months ago

On the surface, this seems more complicated than just iterating over multiple directories and concatenating the results. (So I'm trying to understand the rationale.)

that's an inherited technical wart and debt. The original design was to say that a scan would always have a single root directory.

AyanSinhaMahapatra commented 11 months ago

Related: https://github.com/nexB/commoncode/issues/35