RFC: Define and develop scoring elements for SCA Clarity

aboutcode-org / scancode.io

ScanCode.io is a server to script and automate software composition analysis pipelines with ScanPipe pipelines. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!

https://scancodeio.readthedocs.io

Apache License 2.0

117 stars 86 forks source link

RFC: Define and develop scoring elements for SCA Clarity #1102

Open DennisClark opened 8 months ago

DennisClark commented 8 months ago

We need to define the scoring elements (criteria), and their weighting factors, to evaluate the quality of scan results, working name "SCA Clarity", roughly equivalent to our scoring elements for license clarity on a specific project. To get things started, I would suggest that some major elements would be

element: number-of-exact-licenses-detected description: the number of licenses detected with an exact license key match.

element: number-of-unknown-licenses-detected description: the number of licenses detected with no exact license key match.

element: percentage-of-exact-licenses-detected description: a percentage of all the license detections that identify specific license keys, as opposed to unknown license references where the text is not matched precisely to a known license.

More ideas and comments are welcome

DennisClark commented 8 months ago

other elements could be:

element: number_of_copyrights_detected description: the number of copyright statements detected in a scan

element: number_of_authors_detected description: the number of authors (contributors) detected in a scan

element: number_of_packages_detected description: the number of packages detected in a scan that can be identified by a valid PURL

DennisClark commented 7 months ago

we might also add:

element: number_of_dependencies_detected description: the number of dependencies identified by inspecting the files that specify other software (usually third-party) required by the project codebase being scanned

mjherzog commented 7 months ago

Some comments:

The scoring needs to be more sophisticated for the detection of packages or other units of software. This is Task 1 in scanning (and for any SBOM). Software units (programs or source) that are not packages may be somewhat analogous to unknown licenses, but not sure that "unknown packages" is a good name.
The scoring for dependencies needs more research to incorporate scope and origin of the dependency (manifest, lock file or other).
We might want to call this SCA Clarity.

DennisClark commented 7 months ago

I like "SCA Clarity". Let's use that term for this.

DennisClark commented 7 months ago

I think we have enough elements identified now to move ahead with some kind of SCA Clarity support in SCIO.

Should this be a standard feature that does not require setting a specific option when doing the scan/etc ? I think yes, but if there are other thoughts on that, they are welcome here.

mjherzog commented 7 months ago

We need to order this so that the clarity of the SBOM contents (software units) is scored separately from the clarity of origin and license information for those software units.

DennisClark commented 7 months ago

a further refinement is probably needed. My original suggestion of element: number_of_packages_detected description: the number of packages detected in a scan that can be identified by a valid PURL

should perhaps be broken down into two types to support container analysis:

element: number_of_system_packages_detected description: the number of packages detected in a scan that can be identified by a valid PURL that originate from a distro or distro repo

element: number_of_application_packages_detected description: the number of packages detected in a scan that can be identified by a valid PURL that do not originate from a distro or distro repo

DennisClark commented 7 months ago

probably best to do the counting of the data in a new pipeline compute-sca-clarity

DennisClark commented 5 months ago

we might also add a negative element:

element: number_of_misleading_matches_reported description: the number of matches (snippet or whole file) that are not quite accurate or do not add meaningful value.