Open hiwit opened 3 years ago
@hiwit Thank you for the report!
Can you provide a few more details?
please confirm you are using Python 3.6
What is your CPU model? how many CPU cores?
How much RAM do you have?
when you say "install some more npm packages" can you be more explicit? There could be 10 or 10,000 of these as npm can have some hidden "iceberg"-like tree of dependencies . Could you run this on your test app directory: find . -name "package.json" | wc -l
to find out how many packages there are. And find . -type f | wc -l
to find out how many files you have.
Have you tried to use --package scan first for a start?
Note that --html-app
is deprecated, you should consider using JSON and the workbench instead
When you say "docker based" you mean you run Docker on macOS? that's several levels of indirection as this is running a VM that is running Docker that is running scancode. ScanCode runs natively on macOS FWIW.
can you paste the time and file number stats printed at the end of your run?
Hi, thanks for your quick reply.
"dependencies": { "@types/clipboard": "^2.0.1", "@types/cookie": "^0.3.3", "@types/jest": "26.0.15", "@types/lodash.replace": "^4.1.6", "@types/node": "12.0.8", "@types/react-dom": "16.8.4", "@types/react-router": "^5.1.8", "@types/react-router-dom": "^4.3.4", "@types/uuid": "^8.3.0", "ajv": "^6.12.6", "clipboard": "^2.0.6", "core-js": "^3.7.0", "date-fns": "^2.16.1", "i18next": "^17.3.1", "jwt-decode": "^3.1.2", "lodash.debounce": "^4.0.8", "lodash.replace": "^4.1.4", "query-string": "^6.13.7", "react": "^16.14.0", "react-app-polyfill": "^1.0.6", "react-dom": "^16.14.0", "react-markdown": "^5.0.3", "react-router": "^5.2.0", "react-router-dom": "^5.2.0", "styled-components": "^5.2.1", "uuid": "^3.3.3", "xml-js": "^1.6.11" }
npm install --only=prod (no dev dependencies)
find . -name "package.json" | wc -l =1337
find . -type f | wc -l = 23027small update: we were able to run scancode in 1362.15s when heavily using the include command.
we filtered all text files before the scan and added the distinct filenames as a long list of includes...
find node_modules -type f -exec grep -Iq . {} \; -print
scancode --include "*.js" --include "*.cc" ... -cl node_modules
but I'm not sure how valid the result is
@hiwit these includes should not be needed, but that said, 23K files in about 20 minutes and 4 processes does NOT sound too crazy.
You should try a plain --package for a start instead and without includes or excludes.
20 minutes is ok, i guess... :) but that was after optimization / filtering :) but it seems that using the json output is also somewhere in that timeframe :) i'll run some more tests to validate
@hiwit ping, any update on your side?
hi @pombredanne, sorry for the late answers.
we're now able to generate a report in about 25 minutes in a concourse job. that's fine. I'm not sure what did the trick in the end. Probably switching to json improved a lot (filtering files was also a good idea :) ).
For me switching to json did not do the trick in ORT.
Unfortunately license texts can be in txt, rst, md, LICENSE, COPYRIGHT and more files. What do all your include parameters look like @hiwit?
@DanielRuf should I reopen this issue or may be it would be better if you create a new one instead, with your details and inputs that are likely different?
@pombredanne I guess I should create a new issue since in my case I use scancode through ORT(with and without Docker).
Inside Docker it takes about 45 minutes, outside of it only 39 minutes with empty caches. Not sure if this is slow or fast for the following setup (package.json and yarn.lock):
@DanielRuf your lock files has links to 227 npms.
Fetching them with wget: Total wall clock time: 20s Downloaded: 227 files, 8.6M in 8.9s (994 KB/s)
Extracting them recursively with extractcode: 2m10.239s
$ find . -type f | wc -l: 11350
Scanning the packages manifests on 6 cores on my laptop with scancode --package --processes 6 --json daniel.json
: 6216 seconds with 227 packages and 2294 dependency instances
Scanning all the package files in 227 packages and 2294 dependency instances is likely to take several hours alright IMHO.
@pombredanne thanks for checking this. I had a small typo, outside of Docker on my 8 core (16 with hyper threading) it took 39 (not 29) minutes, in Docker (ORT image) it was 45 minutes.
So this is a quite normal duration? Sorry for my naive question as I am new to ORT and scancode. I guess we can not speed this much up besides scaling our server vertically (more CPU power)?
Will scancode 32 improve the performance further?
I have to wonder, why it seems "alright" to take hours for such small amount of data?
@DanielRuf re:
Will scancode 32 improve the performance further?
we have some improvements there but you have to try
@wzrdtales re:
I have to wonder, why it seems "alright" to take hours for such small amount of data?
What amount do you qualify as "such a small amount of data"? Can you elaborate on this and what would be your expectation for a specific example (with a download URL so we can reproduce) Thanks!
Of course, for example:
https://github.com/scalatest/scalactic-website
This project we were evaluating on, even after 1 hour did not finish.
One of our own projects:
https://github.com/db-migrate/node-db-migrate
Took over 15 minutes, and it really doesn't have so much to scan for.
Little bit of context, we are running ORT against these, as we're working on integrating ORT for a big open source platform in Germany. But we got more than surprised by the immense time everything takes and nailed this down to scancode.
@wzrdtales re:
https://github.com/scalatest/scalactic-website This project we were evaluating on, even after 1 hour did not finish.
https://github.com/scalatest/scalactic-website/archive/refs/heads/master.zip is over 88MB and 617MB once extracted. There are (likely mistakenly) a lot of generated HTML doc in this with a lot of JS to support them. It took ~ 40 minutes to scan using ScanCode.io and a scan_package pipeline, so this is not great but not too bad either for an almost GB size codebase.
You should consider using actual real projects rather than made-up test projects to organize your evaluation. These are always biased in some weird ways that does not matter in the "real world" IMHO like this weird inclusion of many copies of generated docs.
One of our own projects: https://github.com/db-migrate/node-db-migrate Took over 15 minutes, and it really doesn't have so much to scan for.
Its lockfile has 552 dependencies. That's like running 552 + 1 scans. This is the unfortunate state of JavaScript: too many deps! Ignoring the deps, it scans in about 30 seconds using ScanCode.io and a scan_package pipeline. Taking 15 minutes for 552 packages means that your scan took roughly 2 seconds per package. That's actually not bad at all.
Now you may not even need this at all times, and doing a lighter scans amy be all that's needed in many cases.
And matching with PurlDB in the future will mean that you do not need to scan at all, but just look up a pre-scanned package details.
The thing is, we don't know what we will scan. This will be whatever projects users will submit to the platform. Seems like we can't really reliably use scancode for this scenario. Already for being an attack surface for a very effective DoS attack.
Description
How To Reproduce
Small workaround
System configuration