cashapp / licensee

Gradle plugin which validates the licenses of your dependency graph match what you expect
https://cashapp.github.io/licensee/docs/1.x/
Apache License 2.0
611 stars 28 forks source link

Dependencies with license name provided as `MIT` can't get identified #137

Open mervyn-mccreight opened 1 year ago

mervyn-mccreight commented 1 year ago

Here is the output for using e.g. version 3.1.0, I assume it will be the same for 3.2.0 as well:

io.github.g0dkar:qrcode-kotlin-jvm:3.1.0
 - ERROR: Unknown license URL 'https://github.com/g0dkar/qrcode-kotlin/blob/main/LICENSE' is NOT allowed

According to https://mvnrepository.com/artifact/io.github.g0dkar/qrcode-kotlin-jvm/3.1.0 the license name is published as MIT, so I assume it should get identfied as MIT when I'm looking at the fallback logic in licenses.kt.

I used the version 1.6.0 of this plugin to get the described results.

mervyn-mccreight commented 1 year ago

The same goes for https://mvnrepository.com/artifact/org.amshove.kluent/kluent/1.68

org.amshove.kluent:kluent-common:1.68
 - ERROR: Unknown license URL 'https://github.com/MarkusAmshove/Kluent/blob/master/LICENSE' is NOT allowed
JakeWharton commented 1 year ago

Name-based lookup is only applied if the URL is not present. In both of these cases the URL supersedes that and is (obviously) not a canonical URL for the MIT license.

When we built the library we decided to favor URL since the overwhelming majority linked to an external canonical license copy. I would attempt to reach out to those libraries about switching to the SPDX link or the opensource.org link for the license. In my experience library authors usually understand the argument for pointing at these rather than their own copy.

mervyn-mccreight commented 1 year ago

Ah, I see.

IMO you should consider taking a look at the name even if it can't map the URL to a license, because referring to the maven documentation about the licenses part of the pom, nothing is really specified for the url-part, but for the name part it is at least recommended to use a SPDX-identifier.

What do you think?

JakeWharton commented 1 year ago

That recommendation is very new and when sampling a few thousand artifacts almost none matched. I suspect the fact that MIT matches is more luck than intent. The overwhelming majority were "Apache 2.0" instead of the intended "Apache-2.0", for example.

mervyn-mccreight commented 1 year ago

I understand.

I did not want to criticize the way you came to the conclusion that looking for the license-url with priority makes sense. I don't mean to replace that - what I think of is more like:

Its more like an additional fallback to try to increase the detection rate.

JakeWharton commented 1 year ago

I wouldn't say I'm 100% opposed to it, but I did give this a fair bit of thought when figuring out how the project was going to initially work.

My concern is always that the URL and the name may disagree on the license, and that the Maven POM specification offers no guidance on precedence. I wanted to ensure we never surprised someone so in most cases the code prefers to fail and have a human decide. This is why you can allow a specific URL if you trust it to be stable. Or you can allow a specific coordinate if it uses a custom URL that you do not trust to be stable.

Another thing that was considered early on is attempting to download unknown URLs and match their contents. It was ruled out because the number of custom URLs was very low and things like GitHub/GitLab links involve a ton of surrounding HTML.

hfhbd commented 1 year ago

There is a problem with the url though: the official urls are often just a template. For a valid license, you need to adopt the owner and year. At least this was the feedback I got: https://github.com/pgjdbc/pgjdbc/pull/2514#issuecomment-1135073402

TheMrMilchmann commented 6 months ago

There is a problem with the url though: the official urls are often just a template. For a valid license, you need to adopt the owner and year.

I would be interested in a version of this plugin that does match by identifier instead for this reason. If this requires more manual overrides, so be it. Without promising anything yet, would you accept a PR that makes the matching strategy configurable or do you consider this to be out-of-scope for this plugin, @JakeWharton?

JakeWharton commented 5 months ago

I think it still has a lot of behavioral questions to work out. It's not like we can just flip a conditional here. Do we fall back to the URL if the name does not match? What happens if the name and URL match different licenses? I simply have never considered matching on the name and the entire design of the existing logic is around the URL, so I'm not keen to rush into something here.

TheMrMilchmann commented 2 months ago

First of all, sorry for going quiet for a while, but work and vacation kept me busy 😅. I fully agree that this is not something that should be rushed, but I did actually have an idea in mind already:

I propose to expose an API that the build author may use to tune the strategy. Specifically, we can mimic the API for configuring metadata sources for repositories. This could look as follows (exact names and syntax are tbd):

licensee {
    licenseSources {       // 1
        licenseName()      // 2
        licenseUrl()       // 3
    }
}

We introduce a LicenseSources interface that may be used to configure the sources from which to infer license information. In the DSL, this is exposed under the licensee extension (1). The interface defines several methods that may be used to enable certain sources of information. Initially, those would use be license name (2) and license URL (3), but additional sources could be added in the future without breaking configuration or introducing needless complexity.

licenseSources would be a regular function under the licensee extension and probably have the signature fun licenseSources(action: Action<LicenseSources>). If the function is not called, we resort to the current behavior as default.

As an added benefit, we could allow the build author to call licenseSources {} (without adding any sources) to indicate that all license mappings must be specified explicitly. However, I suggest deferring this decision and treating it as a configuration mistake for now.