Closed elrayle closed 1 week ago
In ScanCode output, there are a set of flags for each license match:
"is_license_text": true,
"is_license_notice": false,
"is_license_reference": false,
"is_license_tag": false,
"is_license_intro": false,
Some discussions on is_license_text
, is_license_notice
, is_license_tag
, is_license_reference
can be found in this documentation. Discussion on is_license_intro
can be found at nexB/scancode-toolkit#3082. These may be of help.
@lumaxis - copied from Issue #1023 (dup of this issue)
Today, ClearlyDefined's tooling only supports licenses that are part of the standardized SPDX License List. If a package's license is not on that list and e.g. ScanCode outputs a user-defined license ref in the format of
LicenseRef-*
, ClearlyDefined's processing will result inNOASSERTION
.We could fairly trivially add support for this on the code-level I believe but I think it warrants a larger discussion to see if ClearlyDefined want to support licenses that are not on the SPDX License List.
References:
Example of scancode license ref: LicenseRef-scancode-mit-old-style
See documentation: How To Add a New License for Detection
@nickvidal - 2023-09-05 - copied from email - Beyond SPDX
In today's meeting, there was a request from GitHub for ClearlyDefined to have more licensing coverage, beyond SPDX. In other words, there's a need to cover not just open source licenses, but proprietary ones as well. GitHub is in the process of setting a local harvest, and they plan on sharing this licensing metadata with others.
This is a common request that I have already heard from other community members. I would like to hear feedback from everyone about this. If this is something that you would also be interested in, please let us know.
Josh Berkus - 2023-09-05 - copied from email - Beyond SPDX
I can see the value, but it sounds like an impossible task. How many thousands of proprietary licenses are there across the industry?
@jeffwilcox - 2023-09-05 - copied from email - Beyond SPDX
The requirement is real but as Josh says, the task is challenging. IIRC there are two main challenges: automation and namespacing. Right now most of the scanners have some set of regular expressions, templates, or some such they use to identify given text as a particular license. For arbitrary text, that's hard to do. For some of the scenarios you can rely on just having a discoverable identifier (no need for matching), for others, not so much and automation is essential.
Namespace management is just hard. Some have proposed using internet domains (ala Java package naming) but many licenses are, or evolve to be, not "owned" by a particular organization.
In the past a few folks have discussed using the SPDX "LicenseRef-" syntax and some auto generated hashing of unrecognized license text. That combined with an alias registry gets you the ability to have automated detection and a human-readable, manageable namespace.
The idea is that unrecognized license text is hashed and then referenced as "LicenseRef-XYZABC123" (or some such). Off the bat all such licensed packages are correlated and so can be "cleared" by legal teams together and collaboratively. Over time curators may come to see that hash as the "FooBar" license and then register an alias for the hash. Then "LicenseRef-XYZABC123" and "LicenseRef-FooBar" are then interchangeable (with the latter being more user-friendly"). Variants of FooBar with different hashes can also be aliased to FooBar. It is even possible that FooBar ends up being a recognized SPDX license. If it retains that name, all's good, if it gets a new name, register and alias for that too.
On the practical user side, sticking with SPDX valid syntax allows for the continued use of SPDX tooling and integrations.
Eric Schultz - 2023-09-05 - copied from email - Beyond SPDX
Maybe other folks are interested in this but as a community member I'm personally a bit turned off by making it easier to use and create proprietary software. I use ClearlyDefined and participate to make open source software easier to make and maintain. From my point of view, if anything, I DON'T want to make it easier to make proprietary software.
Others mileage may vary, of course, but I don't think I'm alone among current or potential community members.
@pombredanne - 2029-09-29 - copied from email - Beyond SPDX
FWIW, ScanCode which is used in ClearlyDefined has the largest open database of licenses that I know of this side of the galaxy quadrant;) https://scancode-licensedb.aboutcode.org/ tracks both open source and proprietary licenses including. About 2,000 of them. ScanCode can detect exactly and approximately a large range of known and unknown licenses thanks to its small language model based on ~ 30,000 license samples.
So this is already in ClearlyDefined and all these licenses come with a stable SPDX LicenseRef. It would be stupid not to leverage this or to create something else for this.
Maybe other folks are interested in this but as a community member I'm personally a bit turned off by making it easier to use and create proprietary software.
Perhaps I'm missing something, but my understanding is that we're trying to identify the actual licenses in content. Using a well-known stable source for SPDX compatible LicenseRef
codes for that is a no-brainer for me.
Reporting discovered licenses is orthogonal to encouraging their use. I'd much rather know that there is some proprietary license in content than not.
I oppose adding non-open source licenses to ClearlyDefined. I've volunteered my time on Noticeme and used ClearlyDefined because I wanted to make the work of open source creators easier. Making machine readable proprietary license information encourages the creation of proprietary software. I don't see how this is consistent with the goals of the project or of it being an incubator project of OSI.
@wwahammy I don't see how clearly defining a component, be it open source or not, encourages the use of proprietary software. In fact, it might even help those who want to avoid them. The majority of people I talked with are in favor of this functionality and would find it useful.
@wwahammy is it your intention to suggest that the registry should not include content for which non-standard SPDX/proprietary licenses are discovered or that we the registry should just ignore and not list non-standard SPDX/proprietary licenses?
I believe ClearlyDefined should provide information on the license of software, provided they are open source and fit into SPDX. Anything else is out of scope and other tools should be used for that.
To be clear, if the license isn't open source, I think ClearlyDefined should simply indicate that it's proprietary. If a user feels they want to use that software still, then it's on them to research it or find a different tool.
Throwing away license information makes no sense to me.
@wwahammy
To be clear, if the license isn't open source, I think ClearlyDefined should simply indicate that it's proprietary. If a user feels they want to use that software still, then it's on them to research it or find a different tool.
I command these noble goals ... But IMHO this approach is neither practical nor useful. Detecting and reporting licenses is neither a political or ethical statement, just an observation of facts.
FWIW, SPDX has mostly given up tracking only OSI licenses.
More power to anyone who wants to discard, ignore, skip anything that does not fit their taste, policy, choices, view of the world, ethics or else. But one needs the facts to do this.
As one of many examples, common public domain dedications are not tracked nor supported by SPDX and are not approved as OSI licenses. Not a single lawyer I know is treating these as proprietary licenses. They are carefully cataloged and properly detected by ScanCode (at least 850+ variants of these at last count plus an infinity of variations detected approximately). Throwing this information away as it is done today in CD is like shooting yourself in the foot. It hurts.
Collecting data is not endorsing nor promoting anything in particular be it proprietary, open source, free software, source available or else. But rather, just accepting that the world of actual licenses is what it is in all its glorious messy diversity and capturing what these licenses are, without discarding valuable information detected by ScanCode. Discarding and losing data has been the problem until now and has been making ClearlyDefined data mostly harmless and useless at scale as you get better and more information out of a straight ScanCode scan.
You are welcome to use anything you like, but I think it would be better to adopt the de-facto industry standard of ScanCode license data, rather than to reinvent the wheel, especially since ClearlyDefined is kinda using ScanCode rather heavily.
We use a suffix as LicenseRef-scancode
in https://scancode-licensedb.aboutcode.org/ and guarantee stability of these with the track record to prove this.
We could extract this data in its own top level repo if this can help BTW.
@wwahammy as a practical example close to you
At https://github.com/houdiniproject/houdini/blob/main/LICENSE , the license reported as AGPL-3.0-or-later WITH WTO-AP-3.0-or-later
is an invalid SPDX expression using an unknown SPDX id (not even a licenseref) and is further not OSI approved last I checked.
This license is tracked and detected in ScanCode using https://github.com/nexB/scancode-toolkit/blob/f70bbb7d9d9bab40a9d504e664bc945b6a1630e8/src/licensedcode/data/licenses/houdini-project.LICENSE#L38
Should we treat this instead as a proprietary license and ensure that any code using it is banned from here and pointed to as something terribly bad to ignore and skip?
NB: I am being facetious here as houdiniproject is @wwahammy 's own project ;)
@pombredanne I'm all for adding a way to handle additional permissions on well-known FOSS licenses. Houdini's license is AGPL3+ with an additional permission. OSI and SPDX don't seem to really handle them/want to handle them. So I'm fine with that.
One of the motivators on this issue, if I understand history correctly, is to better document proprietary licenses in repositories. I think that specifically is out of scope for ClearlyDefined.
One of the motivators on this issue, if I understand history correctly, is to better document proprietary licenses in repositories. I think that specifically is out of scope for ClearlyDefined.
I've only been involved with ClearlyDefined for a fairly short time but this seems counter to what I'm reading on https://docs.clearlydefined.io/charter . At least as I interpret the words there, ClearlyDefined's charter fairly clearly supports adding any additional data that could be helpful to users of software:
ClearlyDefined will pursue any data that makes FOSS projects easier to consume and thus more successful.
I guess I'm not sure how knowing the specifics of proprietary licenses would "make FOSS projects easier to consume".
@wwahammy As a consumer of a dependency, if the license is NOASSERTION or OTHER, I can't be sure if the problem is that...
If a LicenseRef is specified, I know...
From this, I can gather more information on the license allowing determination of whether it is a corporate license, a proprietary license, or an open source license not part of SPDX. From that information, institutions can decide whether the license is acceptable. Without it, there is no way to make that determination.
This empowers each site to establish the level of strictness that is acceptable. For some, it may be a rejection of all non-SPDX licenses, in which case, any dependency with a LicenseRef can be excluded from the allow policy.
This approach addresses the needs of more members of the community increasing the power and acceptance of ClearlyDefined.
This functionality was released in service/2.0 and crawler/2.0. A blog post will be posted soon. In the mean time, look at the release notes.
Description
Currently, it is my understanding that licenses need to be a SPDX recognized license to be reported by ClearlyDefined. Other licenses or missing licenses are reported as NOASSERTION or OTHER.
Many of those licenses are known and were identified by nexB/scancode-toolkit as License Refs. This issue wants to identify an approach to use to support the License Refs in the output reported by ClearlyDefined.