Running scancode in a npm based repository takes long

aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!

https://aboutcode.org/scancode/

2.13k stars 551 forks source link

Running scancode in a npm based repository takes long #2478

Open hiwit opened 3 years ago

hiwit commented 3 years ago

Description

we want to include scancode in our pipeline but are running into timing issues
scanning a code base with node_modules for a react / npm project is very slow (>5h).

How To Reproduce

npx create-react-app test
install some more npm packages
scancode -cl -n 4 --html-app /project/result.html /project

Small workaround

we were able to reduce the time by including only text files (difficult to perform as the npm packages differ in structure / language and other types of files)

System configuration

running on macos (catalina/big sur)
scancode version 21.3.31
as an Application // scancode --help AND docker based

pombredanne commented 3 years ago

@hiwit Thank you for the report!

Can you provide a few more details?

please confirm you are using Python 3.6
What is your CPU model? how many CPU cores?
How much RAM do you have?
when you say "install some more npm packages" can you be more explicit? There could be 10 or 10,000 of these as npm can have some hidden "iceberg"-like tree of dependencies . Could you run this on your test app directory: find . -name "package.json" | wc -l to find out how many packages there are. And find . -type f | wc -l to find out how many files you have.
Have you tried to use --package scan first for a start?
Note that --html-app is deprecated, you should consider using JSON and the workbench instead
When you say "docker based" you mean you run Docker on macOS? that's several levels of indirection as this is running a VM that is running Docker that is running scancode. ScanCode runs natively on macOS FWIW.
can you paste the time and file number stats printed at the end of your run?

hiwit commented 3 years ago

Hi, thanks for your quick reply.

yes, we were using python 3.6
2,6 GHz 6-Core Intel Core i7
16 GB 2400 MHz DDR4
for our target application we have the following dependencies "dependencies": { "@types/clipboard": "^2.0.1", "@types/cookie": "^0.3.3", "@types/jest": "26.0.15", "@types/lodash.replace": "^4.1.6", "@types/node": "12.0.8", "@types/react-dom": "16.8.4", "@types/react-router": "^5.1.8", "@types/react-router-dom": "^4.3.4", "@types/uuid": "^8.3.0", "ajv": "^6.12.6", "clipboard": "^2.0.6", "core-js": "^3.7.0", "date-fns": "^2.16.1", "i18next": "^17.3.1", "jwt-decode": "^3.1.2", "lodash.debounce": "^4.0.8", "lodash.replace": "^4.1.4", "query-string": "^6.13.7", "react": "^16.14.0", "react-app-polyfill": "^1.0.6", "react-dom": "^16.14.0", "react-markdown": "^5.0.3", "react-router": "^5.2.0", "react-router-dom": "^5.2.0", "styled-components": "^5.2.1", "uuid": "^3.3.3", "xml-js": "^1.6.11" } npm install --only=prod (no dev dependencies) find . -name "package.json" | wc -l =1337 find . -type f | wc -l = 23027
actually, no. I was not sure what the package option is really for :/
will check that
we first tried it on docker (and it probably will in the pipeline), but the problem also occures on the "native" python app
without include/ignore options we were not able to complete the scan.

hiwit commented 3 years ago

small update: we were able to run scancode in 1362.15s when heavily using the include command. we filtered all text files before the scan and added the distinct filenames as a long list of includes... find node_modules -type f -exec grep -Iq . {} \; -print scancode --include "*.js" --include "*.cc" ... -cl node_modules but I'm not sure how valid the result is

pombredanne commented 3 years ago

@hiwit these includes should not be needed, but that said, 23K files in about 20 minutes and 4 processes does NOT sound too crazy.

You should try a plain --package for a start instead and without includes or excludes.

hiwit commented 3 years ago

20 minutes is ok, i guess... :) but that was after optimization / filtering :) but it seems that using the json output is also somewhere in that timeframe :) i'll run some more tests to validate

pombredanne commented 3 years ago

@hiwit ping, any update on your side?

hiwit commented 3 years ago

hi @pombredanne, sorry for the late answers.

we're now able to generate a report in about 25 minutes in a concourse job. that's fine. I'm not sure what did the trick in the end. Probably switching to json improved a lot (filtering files was also a good idea :) ).

DanielRuf commented 1 year ago

For me switching to json did not do the trick in ORT.

Unfortunately license texts can be in txt, rst, md, LICENSE, COPYRIGHT and more files. What do all your include parameters look like @hiwit?

pombredanne commented 1 year ago

@DanielRuf should I reopen this issue or may be it would be better if you create a new one instead, with your details and inputs that are likely different?

DanielRuf commented 1 year ago

@pombredanne I guess I should create a new issue since in my case I use scancode through ORT(with and without Docker).

Inside Docker it takes about 45 minutes, outside of it only 39 minutes with empty caches. Not sure if this is slow or fast for the following setup (package.json and yarn.lock):

package-files.zip

pombredanne commented 1 year ago

@DanielRuf your lock files has links to 227 npms.

Fetching them with wget: Total wall clock time: 20s Downloaded: 227 files, 8.6M in 8.9s (994 KB/s)
Extracting them recursively with extractcode: 2m10.239s
$ find . -type f | wc -l: 11350
Scanning the packages manifests on 6 cores on my laptop with scancode --package --processes 6 --json daniel.json: 6216 seconds with 227 packages and 2294 dependency instances
Scanning all the package files in 227 packages and 2294 dependency instances is likely to take several hours alright IMHO.

DanielRuf commented 1 year ago

@pombredanne thanks for checking this. I had a small typo, outside of Docker on my 8 core (16 with hyper threading) it took 39 (not 29) minutes, in Docker (ORT image) it was 45 minutes.

So this is a quite normal duration? Sorry for my naive question as I am new to ORT and scancode. I guess we can not speed this much up besides scaling our server vertically (more CPU power)?

Will scancode 32 improve the performance further?

wzrdtales commented 1 year ago

I have to wonder, why it seems "alright" to take hours for such small amount of data?

pombredanne commented 1 year ago

@DanielRuf re:

Will scancode 32 improve the performance further?

we have some improvements there but you have to try

@wzrdtales re:

I have to wonder, why it seems "alright" to take hours for such small amount of data?

What amount do you qualify as "such a small amount of data"? Can you elaborate on this and what would be your expectation for a specific example (with a download URL so we can reproduce) Thanks!

wzrdtales commented 1 year ago

Of course, for example:

https://github.com/scalatest/scalactic-website

This project we were evaluating on, even after 1 hour did not finish.

One of our own projects:

https://github.com/db-migrate/node-db-migrate

Took over 15 minutes, and it really doesn't have so much to scan for.

wzrdtales commented 1 year ago

Little bit of context, we are running ORT against these, as we're working on integrating ORT for a big open source platform in Germany. But we got more than surprised by the immense time everything takes and nailed this down to scancode.

pombredanne commented 1 year ago

@wzrdtales re:

https://github.com/scalatest/scalactic-website This project we were evaluating on, even after 1 hour did not finish.

https://github.com/scalatest/scalactic-website/archive/refs/heads/master.zip is over 88MB and 617MB once extracted. There are (likely mistakenly) a lot of generated HTML doc in this with a lot of JS to support them. It took ~ 40 minutes to scan using ScanCode.io and a scan_package pipeline, so this is not great but not too bad either for an almost GB size codebase.

You should consider using actual real projects rather than made-up test projects to organize your evaluation. These are always biased in some weird ways that does not matter in the "real world" IMHO like this weird inclusion of many copies of generated docs.

One of our own projects: https://github.com/db-migrate/node-db-migrate Took over 15 minutes, and it really doesn't have so much to scan for.

Its lockfile has 552 dependencies. That's like running 552 + 1 scans. This is the unfortunate state of JavaScript: too many deps! Ignoring the deps, it scans in about 30 seconds using ScanCode.io and a scan_package pipeline. Taking 15 minutes for 552 packages means that your scan took roughly 2 seconds per package. That's actually not bad at all.

Now you may not even need this at all times, and doing a lighter scans amy be all that's needed in many cases.

And matching with PurlDB in the future will mean that you do not need to scan at all, but just look up a pre-scanned package details.

wzrdtales commented 1 year ago

The thing is, we don't know what we will scan. This will be whatever projects users will submit to the platform. Seems like we can't really reliably use scancode for this scenario. Already for being an attack surface for a very effective DoS attack.