anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
5.73k stars 526 forks source link

`License` field in Python package metadata could be name or full text #2969

Open mmarseu opened 1 week ago

mmarseu commented 1 week ago

What would you like to be added:

The python-installed-package-cataloger cataloger could employ a heuristic to determine whether the License field in package metadata contains a license descriptor or the full license text. For example, if a certain number of newlines and text length are exceeded, the value could be considered the full text.

When it's determined to be the full text, it should be added as such to the SBOM. In CycloneDX, that means creating a license object such as:

"license": {
  "name": "Found in <path>",
  "text": {
    "content": "<full text>"
  }
}

Why is this needed:

The License field isn't clearly defined. While in my experience, most packages just put down a license name or even SPDX id, it is not uncommon to find the full text in there. For example, pandas uses it this way.

Additional context:

This would fit well with #656. If a full text is identified, it could immediately be classified.

License field might be deprecated if PEP-639 get's approved. Still, even then I believe this issue will stay relevant for years to come.

Joerki commented 1 week ago

I experienced this also with pandas and scipy.

Regarding the definition: In pyproject.toml (https://packaging.python.org/en/latest/guides/writing-pyproject-toml/) it is possible to specify either the license text (should be identifier) or file (license text file) to include license information.

This is basically not a good definition, since there should be a clear distinction between IDs and full text.

It is getting even worse when this file does not just include the project's main license, but also software that is bundled with the package. I assume that there is no safe method to distinguish between those licenses (similar to non-machine readable Debian copyright files) based on a text with licenses that have no clear separation.

I suspect that this multi-licensing is the reason that we get the full text here, and not just the ID.

Do you have another, working idea already?

mmarseu commented 5 days ago

@Joerki I believe that problem is beyond the scope of my issue but very much in scope of #656. Is has been suggested there to use https://github.com/google/licenseclassifier which attempts to deal with these kinds of aggregated license texts.

As for this issue, I'd already be happy if syft would insert the full text it finds as a single license text, even if it really contains multiple licenses.