clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 41 forks source link

Special handling of "BSD" license declaration #459

Open dabutvin opened 5 years ago

dabutvin commented 5 years ago

right now SPDX.normalize("BSD") === "NOASSERTION"

Looking through a lot of the causes for NOASSERTION, BSD has popped up a lot. Instead of declaring this as junk, we can consider treating "BSD" as no information and letting our other indicators take precendence.

For example, if you pack a BSD-3-Clause license file, but declare your package as "BSD" in your manifest, we could fall back to the license detection on the license file

dabutvin commented 5 years ago

we should abstain for "BSD" in package.json

dabutvin commented 5 years ago

example of BSD package.json that should fall back to license file: https://clearlydefined.io/definitions/npm/npmjs/-/source-map/0.2.0

example of BSD in package.json that should not fall back to license file: https://clearlydefined.io/definitions/npm/npmjs/-/unique-stream/1.0.0

We need to keep the value BSD from the clearlydefined summarizer instead of turning it into NOASSERTION. then use that string to decide later.

the only declared license we can make from a package.json that says "BSD" is "BSD-3-Clause" or "BSD-2-Clause".

We can make this mapping if another summarizer declared this license in the util.mergeDefinitions

jeffmcaffer commented 5 years ago

Really interesting. Not sure what's best here. If we just bail and ignore it then we are violating the NOASSERTION means we found something but did not understand. If we put in NOASSERTION and "merge it out" then we are ignoring the fact that we didn't understand it. IMO the right thing to do is keep the NOASSERTION and leave it to a refinement algorithm (e.g., suggestions) to see if the root cause of the NOASSERTION can be disambiguated in a larger context (e.g., if scancode detected BSD-3-Clause in the license file, we can ignore BSD from the ClearlyDefined tool).

That refinement might be in a subsequent aggregation/summarization step or in an auto-curation step

ariel11 commented 5 years ago

Is this why on this one - https://clearlydefined.io/definitions/pypi/pypi/-/sphinxcontrib-htmlhelp/1.0.2/1.0.2 - the "LICENSE" file is blank instead of saying "NOASSERTION." I thought "NOASSERTION" was an indicator that there is license information on a particualr file. If our scanners cannot tell the "LICENSE" file is BSD-2-clause, it should have at least put "NOASSERTION" for the "LICENSE" file in my opinion.

kpfleming commented 5 years ago

Jumping in here after @fossygirl sent me a link to this issue. I came across an NPM package with a similar situation ('BSD" in package.json, no other indication of a top-level license, ClearlyDefined website says "Declared: NOASSERTION"), and was confused.

In this case, there is an assertion, it just doesn't match any licenses from the SPDX license list. Claiming "NOASSERTION" is a bit harmful, in that a curator/reader may assume that the package maintainer didn't make any attempt to declare a license, but the maintainer did.

Rather than special-casing "BSD", how about using the normal SPDX mechanism for this: a LicenseRef gets created with ExtractedText containing the unrecognized/unmatched license text. In a normal SPDX document the scope of uniqueness for LIcenseRef is the document itself, but that wouldn't be practical for ClearlyDefined since it would mean that sorting/filtering on "LIcenseRef-3" would be useless, as there would be many definitions of LicenseRef-3. Instead ClearlyDefined might need to have a single LicenseRef list for the entire site, and the harvester would choose the proper LicenseRef by matching the ExtractedText to an existing LicenseRef.

In this model, NPM's "BSD" would be some random LicenseRef-, but all of them would point to the same LicenseRef-. We'd have to think about whether this affects the license score or not; I'd guess probably not, but at least the curator(s) would have a better starting point than "NOASSERTION".

jeffmcaffer commented 5 years ago

There has been some discussion around the generalization of this. SPDX has a proposal for namespacing licenses. That does not really help here. What @pombredanne, @tsteenbe, and I were considering was that ClearlyDefined (or SPDX for that matter) have a dynamic mechanism that generated hashes for normalized license text and uses that as the LicenseRef value. That way it would be globally unique, instantly available and avoid collisions.

In the eventual future there could be an Aliasing facility such that if/when a formal SPDX id is defined for the license formerly known as LicenseRef-42, we can associate the new, good, id with the old, cryptic hash.

Would love to have more support for that direction. We can simply "do it" in ClearlyDefined but better would be broader support.