Open ericphanson opened 3 years ago
In terms of a timeline, I'd actually prefer to let the current version simmer for a bit. We haven't seen any false positives yet, and with the relaxed rules (just needs to find any OSI licenses in any top-level file, no requirement to have only OSI-licenses or to recognize a significant portion of the file) I don't expect many false positives.
Making these changes will certainly create false positives (e.g. Torch.jl), which is why it would come with an override system. But that still requires the users to stop and learn the new system, add some fields to their Project.toml etc. So I think it might be better to see how this more relaxed version shakes out before tightening the requirements (even if they do come with an override). Of course, we could also introduce the override system before tightening the requriments; I think that would be fine/helpful too.
It seems like and explicit license_path
field is unnecessary if we just use the already universal convention that the files are named LICENSE
or LICENSE.md
.
If doing this then should match Cargo.io
So it should be licence-file.
Not licence-path
.
https://doc.rust-lang.org/cargo/reference/manifest.html#the-license-and-license-file-fields
(Cargo also says having either should be acceptable. Not both required.)
possibly as a separate health-check
I've created https://github.com/JuliaRegistries/RegistryCIHealthChecks.jl as a possible place for us to put all the optional health checks (suggestions) that don't block AutoMerge.
Of course, we could also introduce the override system before tightening the requriments; I think that would be fine/helpful too.
I think this is a good approach. Introduce the override system now, but wait a little while before tightening the requirements.
+1 to pretty much copy Cargo unless there are strong reasons not to.
It seems like and explicit license_path field is unnecessary if we just use the already universal convention that the files are named LICENSE or LICENSE.md.
I wouldn't call it universal. GNU actually says to call it COPYING when using the GPL licenses in their FAQ (https://www.gnu.org/licenses/gpl-howto.html).
If doing this then should match Cargo.io So it should be
licence-file.
Notlicence-path
.
While naming doesn't matter that much to me, that also seems fine.
doc.rust-lang.org/cargo/reference/manifest.html#the-license-and-license-file-fields
(Cargo also says having either should be acceptable. Not both required.)
I think license-{file,path}
should be required. The reason we suggested having both is that we never want the Project.toml
to be the primary source of truth; the only reason we have a license
field at all is to override faulty autodetection; but that is not a substitute for having an actual license shipping with source. We should not treat it as sufficient for a project to ship something that has license = MIT
in it, we need the actual license itself.
If we start treating this field as the primary source of truth, we can get into situations where we, for instance, relicense a package from MIT
to GPL
and automated tools still say "this entire dependency tree is MIT-compatible!". To detect this, the automated tools can look at the override (license
) and the actual license file (license-{file,path}
) and can say "hey, you say this is MIT, but I am 89% certain this license file is actually GPL".
Also, we want to avoid the case where someone tries to publish something without a license file entirely and simply tries to short-circuit it by putting something in the license
field; we don't want to accidentally run afoul of distributing source without the proper license file as required by many of these software licenses. While I am sure some would argue that having license = MIT
and the prevalence of the MIT license on the internet fulfills the spirit of the law, I think this is not a large ask and is best practice to enforce that the file always exists and is the source of truth for all licensing.
To clarify: you're arguing for requiring that the file exist not that the field specifying where the file is exist? If one of the standard license files exists would that be sufficient without having the explicit path in the project file? There are a couple of reasons I'd prefer the field to be optional:
Convention over explicit mandatory configuration: I'd rather not require people to spell out what it already the standard way to do something.
If the field is optional then 99% of packages already satisfy this design and don't need to change anything. If we require it then every package needs to add this field to be compliant, which is just a lot of unnecessary pull requests and we basically guarantee never getting to 100% compliance without extraordinary efforts that I don't think we want to go to.
I don't care about allowing the file to be names COPYING
since even though the GPL tells you to do that, in practice hardly any Julia packages do that. If we're worried about ambiguous situations where more than one of LICENSE
, LICENSE.md
or COPYING
exist, it would be fine to only allow the file name to be implicit if only one of those files exist and require an explicit path otherwise.
Just to clarify, the discussion in this issue only applies to packages that fail the license auto-detection.
At last count, approximately 92-93% of packages pass the license auto-detection. These packages will not need to add any fields to the Project.toml
.
The only packages that will need to add these fields to the Project.toml
file are the 7-8% of packages that fail the license auto-detection.
If one of the standard license files exists would that be sufficient without having the explicit path in the project file?
Yes, that's correct. If you have (for example) a LICENSE.md
file in your package, and at least 70% of the text in that file matches one or more OSI-approved licenses, then your package will pass the license auto-detection, and you will not need to make any modifications to your Project.toml
file.
Right; the two modes of operation are: no fields necessary (because license autodetection works) or both fields necessary (to assert that (a) a license file exists in upstream at all, and (b) that the license file is of type FOO
).
Ok, thanks for the additional explanation. I would propose a modification then:
LICENSE
, LICENSE.md
, COPYING
or COPYING.md
exists, that is considered to be the license file.license-file
field MUST be present in the project file; that path is the license file and must exist.license
field in the project file, whose value must be the name of an OSI-approved license.This decouples determining the license file from classifying that file's license type.
If we change (1) in Stefan's post to our current "license file autodetection" which checks either all top-level files including those three when they exist (if there are fewer than 100 top-level files), or those three names plus ~500 more, then I think we're back at the proposal of the first post, https://github.com/JuliaRegistries/RegistryCI.jl/issues/352#issue-826565273.
So it seems like there are two main proposals: the above three steps (up to some discussion over what the license autodetection criteria should be) vs cargo's approach, where you must have either one of the two in the Project.toml and it's fine to have license
without license-file
.
Cargo's approach seems very disruptive for us because it means ~4k packages need to add a field to their Project.toml to be registered. But we could drop that bit of their system and say you can have zero if autodetection succeeds, or one of the two, and then we get back to the source of truth issue.
Cargo's approach seems very disruptive for us because it means ~4k packages need to add a field to their Project.toml to be registered.
In my opinion, this seems like a nonstarter, since we would need to make modifications to thousands of packages.
But we could drop that bit of their system and say you can have zero if autodetection succeeds, or one of the two, and then we get back to the source of truth issue.
Personally, I do not like this either. I think we should have a single source of truth (the license file).
If we change (1) in Stefan's post to our current "license file autodetection" which checks either all top-level files including those three when they exist (if there are fewer than 100 top-level files), or those three names plus ~500 more, then I think we're back at the proposal of the first post, #352 (comment).
I think this is the best approach.
Looks like some packages are licensed under two different licenses, e.g. https://github.com/jean-pierreBoth/RandomProjectionTree.jl and https://github.com/jean-pierreBoth/HnswAnn.jl, there could be others.
How should those be represented?
Also some packages include separate licenses for components that they redistribute, e.g. https://github.com/felipenoris/BLPData.jl and quite a few others. And jll packages (artifacts) may need to specify multiple licenses of components that they redistribute. Should there be a way to specify multiple licenses for such cases?
Those are great points, that the override might need to handle multiple licenses. I just want to point out though that the existing license autodetection seems to work pretty well on all of those:
julia> using LicenseCheck, DataFrames, PackageAnalyzer # not yet registered, <https://github.com/JuliaEcosystem/PackageAnalyzer.jl>
julia> license_df(pkg) = transform(DataFrame(analyze(pkg).license_files), :licenses_found => ByRow(lics -> all(LicenseCheck.is_osi_approved, lics)) => :is_osi_approved)
license_df (generic function with 1 method)
julia> license_df("https://github.com/jean-pierreBoth/RandomProjectionTree.jl")
3×4 DataFrame
Row │ license_filename licenses_found license_file_percent_covered is_osi_approved
│ String Vector{String} Float64 Bool
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ LICENSE-APACHE ["Apache-2.0"] 100.0 true
2 │ LICENSE-MIT ["MIT"] 100.0 true
3 │ README.md ["Apache-2.0"] 6.34921 true
julia> license_df("https://github.com/jean-pierreBoth/HnswAnn.jl")
3×4 DataFrame
Row │ license_filename licenses_found license_file_percent_covered is_osi_approved
│ String Vector{String} Float64 Bool
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ LICENSE-APACHE ["Apache-2.0"] 100.0 true
2 │ LICENSE-MIT ["MIT"] 100.0 true
3 │ README.md ["Apache-2.0"] 2.71186 true
julia> license_df("https://github.com/felipenoris/BLPData.jl")
2×4 DataFrame
Row │ license_filename licenses_found license_file_percent_covered is_osi_approved
│ String Vector{String} Float64 Bool
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ LICENSE.openssl ["OpenSSL"] 93.2494 false
2 │ LICENSE ["MIT"] 44.5333 true
In other words, it finds every file with what could be a top-level license, analyzes it, reports back the licenses found in each file and the percent of the file covered by the licenses it successfully detected. The Apache licenses found in the READMEs seems to be due to https://github.com/google/licensecheck/issues/40 and it's still not clear to me if that's a bug or not.
All of these packages would pass the current license check rules (there exists an OSI license, with any amount of the file covered by that license). However, BLPData.jl would not pass the more stringent check. The license file that is most covered is the OpenSSL license which is not OSI approved, and the file with MIT license also contains an unrecognized license (which looks not OSI-compatible, https://github.com/felipenoris/BLPData.jl/blob/0bbcfb8f41fbba496abb78a6675c4f7e41abb76e/LICENSE#L34-L35, although maybe that's OK since it's not a license for the package code itself). So that package would need an override, and one could definitely engineer situations in which the others would too.
That makes me wonder though: from the point of view of RegistryCI/General, we just want to check is there an OSI-compatible license for the code or not, i.e. should we go ahead and register it. We don't actually care what the license is. (Of course, as a user of packages, I care about the terms under which I can use them, but that currently isn't the registries job to manage). So if autodetection fails and we need an override, e.g. BLPData.jl should not need a manual merge every version, then I think it's enough if BLPData.jl claims that it is OSI compatible and there is some way to check that by hand if needed.
So what if there was just a pair of fields osi_licensed_code::Bool
and license_location::String
, and the claim that the package author is making is that there is an OSI-compatible license at that location (but not necessarily which one, e.g. it doesn't matter if they are dual licensed under MIT and apache, or MIT and some proprietary license). And if they attest to that, we will register it. And if we manually check and see that license_location
does not in fact contain an OSI license for the code, then we can block registration and even yank the package if needed, as that is a violation of General's policy, same as before the RegistryCI licence check existed.
Why is this better than providing the name/identifier of the license in the override field?
So what if there was just a pair of fields
osi_licensed_code::Bool
andlicense_location::String
, and the claim that the package author is making is that there is an OSI-compatible license at that location (but not necessarily which one, e.g. it doesn't matter if they are dual licensed under MIT and apache, or MIT and some proprietary license). And if they attest to that, we will register it. And if we manually check and see thatlicense_location
does not in fact contain an OSI license for the code, then we can block registration and even yank the package if needed, as that is a violation of General's policy, same as before the RegistryCI licence check existed.Why is this better than providing the name/identifier of the license in the override field?
* it is simpler if the user doesn't need to learn what an SPDX identifier is just to register their package * it doesn't matter if there is more than one license * it ensures the Project.toml is not the source of truth for licensing information (because it doesn't even say which license it is!)
Sounds good to me!
@ericphanson I think I misread your last comment. I liked the idea of the Project.toml
file not needing to contain the specific name of the license, but instead just containing osi_licensed_code::Bool
and license_location::String
. But I still think that AutoMerge would need to parse the file at license_location
and verify that it was indeed an OSI license.
That is, there shouldn't be any way for a package author to bypass AutoMerge checking the license; we're just providing ways for package authors to point AutoMerge in the right direction.
Well, I don't think we've had any issues (or if so I've missed them) where AutoMerge did not find the license file, as long as it was in the package directory. So I don't think we need anything to help find where the license is.
I think where we may need manual help is verifying that the license is indeed OSI.
We haven't been having much problems there either, but this discussion was in particular regarding the idea of tightening the rules, so that if we find both an OSI license and a non-OSI license, we would not pass the check, since we don't know if it is specified that one can obey by either (dual licensed) or if there are parts licensed under each and both terms need to be upheld.
Therefore, the discussion was about: if we do make this more stringent rule, and start necessarily incurring false-positives, is there a way we can make it less burdensome on both registry & package maintainers. So I don't think using the same tech to check that file helps anything; this is to help the case where we have hit a false positive there.
But, I'm not totally sure we need the more stringent rule. Kinda just sounds like more headaches. I think having the lighter rule is probably OK, but if we do find non-OSI-packages which have slipped past, we should do something about it, e.g. at least not allow them to register more non-compliant versions (which I guess needs some more work to have some list of non-compliant packages that don't get to have new versions until they get off the list).
Ah, I see. In that case, I think I'd prefer that we go with the original plan as described in https://github.com/JuliaRegistries/RegistryCI.jl/issues/352#issue-826565273 and https://github.com/JuliaRegistries/RegistryCI.jl/issues/352#issuecomment-795865372.
Even if we don't adopt the more stringent rule, I think it would still be good to have this functionality available.
(which I guess needs some more work to have some list of non-compliant packages that don't get to have new versions until they get off the list)
It could be good to have a disallowlist of some kind that AutoMerge checks, and if the package name is on the disallowlist, then AutoMerge won't merge any new PRs for that package.
Yeah, I think that is a thing we will need regardless of the specific license-checking detection methodology, in order to have any hope of enforcing General's policies in nuanced situations. E.g. already General says you must have an OSI-approved license, but if you don't, and you manage to pass automerge anyway, there's nothing we realistically can do. BTW, semi-relatedly, we should announce the policy from https://github.com/JuliaRegistries/General/pull/63230 on Discourse, specifically
The more restrictive license of the wrapped code:
- MUST be mentioned in either the third party notice file or the license file (preferably the third party notice file).
since that might not be known by some folks, and if we want to start enforcing that it's better if there's awareness & notice first.
On the #pkg-dev call today, we (@DilumAluthge, @giordano, @staticfloat, @IanButterworth, and I) had a discussion about #344 and license-check overrides. I think the decision was basically to:
license
field to the Project.toml that holds the SPDX identifier, e.g.license = "MIT"
license_path
field that contains the relative path to the license file (which should be in the package directory, not e.g. a toplevel directory when the package is asubdir
package)license
contains an OSI-approved SPDX identifier, andlicense_path
points to a file that exists, we use this as an override and pass the license checkLicenseCheck.find_licenses
but with the stronger restrictions orignially from #344, i.e. we need to detect 70% of the file and only find OSI-approved licenses.Why add two fields?
The idea is to keep the license file itself as the source of truth, and simply provide a way to override both the detecting-the-file step and the detecting-the-file-contents step. I.e.
license="MIT"
is not the source of truth; it as an assertion that the file located atlicense_path
contains the MIT license.I think from this point of view, if the
license_path
field exists but not thelicense
one, then we should just look at the file located at that path and try to detect the licenses within. If justlicense
exists, I think we should ignore it because we don't know which file it's talking about.What if the file at
license_path
does not seem to contain the license claimed?This should be a warning, not a fatal error, since the point is to override possibly-faulty detection. This warning could be issued by RegistryCI (which currently does not emit non-fatal errors to the comment, but could be made to do so), or possibly as a separate health-check, like https://github.com/giordano/AnalyzeRegistry.jl/pull/30.