anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
5.73k stars 526 forks source link

The ability to extract the contents of the license file (LICENSE.txt) itself #2958

Open MArfanM opened 2 weeks ago

MArfanM commented 2 weeks ago

What would you like to be added: On top of Syft's ability to find the license names, the ability to extract the contents of the license file (LICENSE.txt) itself.

Why is this needed: For some applications, it is necessary to extract the contents of the license files. It's not reliable to assume that two different packages with the same license type will contain the same LICENSE.txt content. This is because some licenses that require slight modifications to the original template to be used (e.g. MIT requires adding copyright info on top).

If there does exist enough data inside Docker Images for Syft to retrieve the license data, a feature feature like this would allow a quick and reliable way of retrieving license file contents without needing to look through each and every package's repository.

Additional context: For security reasons, I am writing a program that uses Syft to automates collecting license information of every dependency/package from a collection of Docker Images. It is a legal requirement that the exact contents of the license files are extracted, and it seems that no tool allows this (tested Trivy, JFrog XRay, etc.).

There is a program that similarly does this, though it is specific to pip-installed packages: https://github.com/raimon49/pip-licenses It will return the license file contents if used with the arguments: pip-licenses --with-license-file --with-notice-file --format json

tgerla commented 2 weeks ago

Hi @DatGameh, thanks for the request. This sounds like a reasonable feature (that would probably be disabled by default). Is it something you're interested in working on yourself? We would be happy to give you some pointers. In any case, we will put this in the backlog for the future.

MArfanM commented 2 weeks ago

Thank you for considering my request @tgerla ! As far as contributions go... I'm curious to know what pointers you have in mind!

I've never made contributions to OSS, and I'm currently discussing if the requirement is really necessary. But if I do get the chance to contribute, I'd like to know what ideas you have for this.

Edit: Reading the code, I found the functions responsible for getting the licenses. If I were to make code changes, would it be done here?

spiffcs commented 1 week ago

That section you linked is just for golang. Each ecosystem has different mechanisms (some undefined) for how to search for and associate licenses to discovered packages.

Enhancing the license struct

The first enhancement would likely be here: https://github.com/anchore/syft/blob/5061b905dc3f8e74dbdb9faf525e50dd0b14db27/syft/pkg/license.go#L17-L32

This is the core license model shared among syft packages. It currently does not have a field for LicenseText as Value and SPDXExpression are used as the current identifying fields.

Value is used when the identified license is found to NOT be a valid SPDX Expression.

Making it configurable

Including the full license text is not something we want as a default behavior so it should be turned off for all default runs of syft. Users should be given the option via configuration to toggle this feature on.

Which ecosystem

Given your issue said you need to extract licenses from software in a docker image I imagine multiple ecosystems are required. This directory structure is a rough list of the different cataloger(ecosystem/specifications) that syft supports: https://github.com/anchore/syft/tree/main/syft/pkg/cataloger