anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.01k stars 553 forks source link

feat: dpkg license improvement for non SPDX licenses #3090

Open spiffcs opened 1 month ago

spiffcs commented 1 month ago

What happened: Sometimes syft can encounter a dpkg license where the regular expression used to match on contents cannot correctly identify the license.

In the following example we should find things like:

NVIDIA Software License Agreement and CUDA Supplement to Software License Agreement

Reads contents of copyright: https://github.com/anchore/syft/blob/ca945d16e0949a41aa8786f55d21908242b224c8/syft/pkg/cataloger/debian/package.go#L252-L276

Sends contents for parsing

https://github.com/anchore/syft/blob/ca945d16e0949a41aa8786f55d21908242b224c8/syft/pkg/cataloger/debian/package.go#L101-L106

Searches for license clause

https://github.com/anchore/syft/blob/48f1e975f05183390d7c01718865f5f66e3f9012/syft/pkg/cataloger/debian/parse_copyright.go#L22-L41

What you expected to happen: Given a copyright file is found SOME license information should be created for a given package. No licenses is a bug.

Steps to reproduce the issue:

syft -o json nvidia/cuda:12.5.1-cudnn-runtime-ubuntu20.04 | grant list -o json | jq -r '.results[]
 | [.license.license_id, .license.name] | @csv' | sed 's/"//g'
spiffcs commented 1 month ago

I've tracked down a couple data sources syft could use to identify non SPDX licenses - currently looking at ways to incorporate these to the licenses identification when generating the SBOM

https://github.com/nexB/scancode-toolkit https://github.com/nexB/scancode-licensedb