anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
5.77k stars 530 forks source link

python cataloger: adding a support additionally to classify licenses by `License-File` field in metadata file #2923

Open Annamikhlin opened 1 month ago

Annamikhlin commented 1 month ago

What would you like to be added: Today the metadata cataloger will look for licenses by searching for declarations within packaging manifests locally in the following files in License field only.: https://github.com/anchore/syft/blob/fe0b78b7fe73b92ad76deed288d3b9b091a14d27/syft/pkg/cataloger/python/cataloger.go#L39-L42 The python cataloger does have the ability to look in additional sibling files that the metadata file might reference too. Adding a support additionally to classify licenses by License-File field as well.

Why is this needed: in our case, in the SBOM scan report (cyclonedx-json format) the license shown as "UNKNOWN"

{
      "bom-ref": "pkg:pypi/scylla-api-client@1.0?package-id=c14a69f4da463c44",
      "type": "library",
      "author": "ScyllaDB",
      "name": "scylla-api-client",
      "version": "1.0",
      "licenses": [
        {
          "license": {
            "name": "UNKNOWN"
          }
        }
      ],

according to cat ./venv/lib/python3.11/site-packages/scylla_api_client-1.0.dist-info/METADATA | grep License

The license declaration shown under License-File filed

License: UNKNOWN
License-File: LICENSE.AGPL
wagoodman commented 1 month ago

I think this is a great candidate for something similarly proposed in #656 . The difference is that 656 is about a standalone cataloger for licenses that would be attached to file objects in the SBOM. This issue is more about enriching license information on the existing package object, which I think is more generally useful thus this should be prioritized ahead.

I think we could leverage https://github.com/google/licenseclassifier to do a lot of the heavy lifting here.

mmarseu commented 2 weeks ago

To add some context, the License-File field as well as the similar License-Expression field are part of PEP-639 which is currently still in draft status.