aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

Review comparative scan reports #3528

Open pombredanne opened 1 year ago

pombredanne commented 1 year ago

See this article by @Mariuxdeangelo https://mariuxdeangelo.gitlab.io/website/#/post/20230924-SBOM-dependency-semantics-SPDX-and-CycloneDx

The scancode results are not great. We can do better!

@Mariuxdeangelo do you mind to share the URLs to the image and archive you have used? and also which version of scancode you used? Toolkit or ScanCode.io? Thanks!

Mariuxdeangelo commented 1 year ago

Hey thanks for reaching out. I'm happy to help. I'm currently working on some research related to SBOMs. Therefore i've pinned the versions of the SBOM generators i use the get comparable results.

I've implemented a Webtool to compare the results and see the generation details. (Currently it's very slow because that thing runs on a potato. I work on adding some cashing to make it faster) https://sbom.seclab.cs.hm.edu/#/

Here a link to the jenkins project view were you can see how scancode has performed compared to other projects and phases. https://sbom.seclab.cs.hm.edu/#/project/43/dependencies

The website is just a little side hustle of mine but i hope it helps. If there are questions on how this side works please feel free to reach out.

What made it very hard for me to use Scancode is, that it consumes lots of resources and takes long to scan a big project like keycloak.

Here the semantics of the command i've used for generation "scancode -clpi -n 10 --cyclonedx /path/to/output.json /path/to/sources

Version information

ScanCode version: 32.0.6
ScanCode Output Format version: 3.0.0
SPDX License list version: 3.21
pombredanne commented 1 year ago

@Mariuxdeangelo thanks for replying and joining the discussion!

https://sbom.seclab.cs.hm.edu/#/project/43/dependencies is awesome!

Some background:

ScanCode toolkit (SCTK) is somewhat unique as this is the only tool that is doing some extensive license and copyright detection and not only a package manifest scan ... this is an expensive operation alright.

Now we also have ScanCode.io (SCIO) where we script in pipelines complex scans (including full docker images) and this is better suited for images alright! It embeds SCTK.

SCTK and SCIO are doing things differently: SCTK is th a CLI-only that just trucks and grinds through a codebase in memory, while SCIO will perform things as needed and store then all in the backing DB following a script.

In SCIO, you can scan docker://jenkins/jenkins:latest and get improved results than anything in SCTK

In addition we have PurlDB, where we can do matching against indexed FOSS packages at https://github.com/nexB/purldb/ which is being rolled in in the SCIO pipelines.

Here is an example of a SCIO screenshot BTW running a docker pipeline on your jenkins image:

Screenshot_2023-09-28 ScanCode io jenkins jenkins latest

Not perfect yet but getting there.

This does not include any matching against the PurlDB though

PS: you may not know that purls started in ScanCode ;)

Mariuxdeangelo commented 1 year ago

Thanks for the insights. I will look into that as soon as i can.

Scanning all files of a project is definitely a cool idea and, of course, uses some resources. You're not only working on SBOMs; there are other use cases where you use that data. Only for me, this was an issue, running Scancode on over 100 fairly large projects with limited resources. I still have some ideas of what I want to do with Scancode that are on my bucket list.