Open jakirkham opened 2 years ago
One challenge i see is with applications like git (GPL) that might need to be installed at the same time as a package that users don't want GPL in their own library, but want to call it from a shell.
It would be good to specify that a certain application tree can be "gpl" but an other application tree should be not-gpl within a same environment.
However, I totally understrand how complicated that seems and that it might just be simpler to ask those users to create a separate environment for git.
Conda does allow environment stacking. Maybe this is one way to go about this?
The general approach seems surprisingly straightforward (didn't know about disallowed_packages
), π
Still, I'd advocate for minimum viable complexity here.
When we go through and hot-fix the repodata to translate
license
tolicense_family
entries, [...]
Where do the license_families
come in in your proposal? A distinction like _license_GPL-2.0-or-later
would already be more granular than the license family. On the other hand, there are many custom licenses that fall under the BSD-family. Perhaps having only license families would be a decent trade-off? I.e. no GPL-version selector, but also a much more manageable list of _license_*
packages (not just for maintainers, but also users).
One thing to figure out here would be how we handle custom licenses.
SPDX covers a very broad range of licenses already; in almost no-case did I not find what I was looking for. I think leaving "OTHER" would be fine in the beginning. If someone doesn't like it, they can improve the feedstock metadata.
Also this could potentially benefit from other Conda solves that use more acceptable licenses.
What do you have in mind here? Talking about solving - usable solver errors should be a goal.
Sorry license_family
is a bit of a distraction. Was just trying to point out we already do hot-fixing with license
as input to that process (the output being setting the license_family
field). This would just be a different output in that hot-fixing process. Anyways agree just using license
should be sufficient
Yeah we do have a few cases of custom licenses like this in conda-forge. Would search for LicenseRef
and they will come up.
Right the idea would be if a user excludes all GPLs and tries to install a package that comes in GPL and non-GPL flavors, the non-GPL package gets installed.
Anyways agree just using
license
should be sufficient
And I was contemplating if license_family
would be a better target. Surely someone will interject that they want the GPL v2-vs-v3 split reflected, but my point was that it's a bit of a numbers game. If we had ~20 license packages based on the families, or ~200 based on the individual licenses, I know which one I'd prefer (better numbers needed, I made those up).
Yeah we do have a few cases of custom licenses like this in conda-forge. Would search for
LicenseRef
and they will come up.
Pillow would actually already be a good demonstration here. Its dependencies already have 5 custom licenses (one of which is classed as BSD).
Right the idea would be if a user excludes all GPLs and tries to install a package that comes in GPL and non-GPL flavors, the non-GPL package gets installed.
I get the mechanics. I'm wondering if the solver errors would be decipherable by an end-user (because the solver error doesn't always point to the right source of conflict).
So I tried a little experiment. It seems disallowed_packages
is not quite as built-out as we might hope. Should add both Conda and Mamba behave the same.
Anyways made the following change to ~/.condarc
disallowed_packages:
- readline
Then tried to install ipython
(note readline
is a dependency of python
):
In short the message is clear, but disallowed_packages
is not considered when solving an environment (only when executing the install). So there is some work needed to get that functionality. Issue ( https://github.com/conda/conda/issues/7526 ) appears to be relevant.
An other important package for the ecosystem is ffmpeg
many python programs do call it from the command line. Do you suggest they used stacked environments too?
I'm just trying to get a handle on a full workflow.
Maybe? Really just thinking through options here.
If the usage model is a user has a collection of things that are CLI only, maybe putting them in another environment that gets stacked on top of is reasonable.
One advantage of this is one can easily separate out the CLI only bits when building new stuff to avoid accidental linking of libraries from there.
Two cents from someone in a corporate environment and maintainer.
Big +1 from my side as this approach doesn't seem to require any involvement from a maintainer or user side if you don't use it which means that this can be rolled out silently and with full backward compatibility.
Regarding the solver error etc: Since using this will require active participation from the corporations using this (just using corporations as a blanket term here) I think it's fair to ask them to go to the docs etc. to figure out what exactly the error means, how this can be stacked etc. Of course, it would be nice to have a more intuitive approach but this seems to already do the trick at least most of the time so I wouldn't throw out the baby with the bathwater and not implement anything.
Re what to use: Since SPDX has everything in a GitHub repo (https://github.com/spdx/license-list-data) it should be simple enough to have a script that runs every so often and converts every single license into a package. Since we already have a _license - licensefamily mapping it seems also possible to have the script generate those automatically as well. Then everyone can decide by themselves which key they prefer. For the custom licenses, it might be interesting to see if we can generate a list of them using CF tools somehow, otherwise, I would go for other right now or manually create a list for the most common ones.
Thanks Bastian! π It's helpful to have that perspective from companies looking for this feature.
Glad to hear this is still useful even with some unfinished edges. Admittedly those could be finished with a bit of effort as it is deemed important.
Good point on providing license
and license_family
as options. Maybe the latter can be named _license_family_*
?
Yeah there is probably a way to extract info about custom licenses. If a catchall (like other
), is good enough maybe we can start with that.
cc @conda-forge/core (in case others have thoughts on this π)
Good point on providing license and license_family as options. Maybe the latter can be named _licensefamily*?
+1 Would probably make the most sense.
Also linking this gitter discussion from today as it seems like there is a real use case for a package that can provide licenses: https://gitter.im/conda-forge/conda-forge.github.io?at=620e828a6e4c1e1c846ccd54 Should also be easy to implement, the only thing to check is the license of the license-shorthands but that seems to be CC0-1.0
I want to weigh in here on a core underlying concern I have.
Our license metadata is at best incomplete and at worst downright misleading. We likely have packages that are mislabeled and others that are marked as say BSD-3-Clause but have additional non-commercial clauses or the like attached to them.
As an organization, we cannot provide any assurances whatsoever that any of these licenses metadata fields, tags, metapackages, etc. produce environments that actually have only the associated licenses.
Given that the context here is corporate users with license restrictions, I very much doubt any of the proposed things here would actually be suitable at the required standard.
I don't think of this as much as an assurance, but more of giving people a tool to follow their interpretation of the different copyright stuff.
Things labeled as BSD but have non commercial clauses should not be labelled BSD....
TBH It gives people with full time jobs more justification to maintain packages beyond those strictly necessary on conda forge.
but more of giving people a tool to follow their interpretation of the different copyright stuff.
Sure. This will help out folks who need to inspect and select licenses. I do believe this is useful. My only concern is to add a proper message/disclaimer stating that we are not responsible for the metadata and not liable for the information there.
Things labeled as BSD but have non commercial clauses should not be labelled BSD....
Of course they shouldn't be. That's not the issue. The issue is that we have 15k feedstocks with even more outputs and we cannot promise that things are correct.
I agree with @ocefpaf that there is utility in having the functionality, even if it's not perfect--but important to have the disclaimer that we make no guarantees to the accuracy for legal purposes.
Another fun corner case would be ones with dual licensing of both the AND and OR variety.
TBH It gives people with full time jobs more justification to maintain packages beyond those strictly necessary on conda forge.
I want to make sure this point doesn't get lost in the (justifiable) skepticism of us pulling this off. The goal here is to make it easier for other participants to get involved in the ecosystem. This ability to constrain licensing seems to be one of the obstacles and it has come up in discussions with others before ( for example https://github.com/conda-forge/conda-forge.github.io/issues/209 ).
Agree adding a disclaimer is a good idea.
One other thing from my side about the validity.
I don't think (or hope) that anyone would take the license information supplied by CF as legal advice so I don't see a risk there. But a prominent disclaimer would probably be beneficial just in case someone decides to blame CF (IANAL though).
What this would make easier though is the "preselection" of packages.
For example: If I want to install a couple of hundred packages legal will need to check all of them and then I will need to find alternatives for the incorrectly licensed ones which could be up to 100%.
If I can specify a license and download only the appropriate ones, I would assume that maybe at least 60%-70% are packages that already have the correct licenses and I only need to find substitutes for 30% of the packages (Where legal found out that the CF supplied license is incorrect). Of course, there is still work involved in double checking the licenses and finding substitutes for incorrectly labelled licenses but the number of substitutes I need to find decreases (hopefully) significantly.
What this would make easier though is the "preselection" of packages.
Indeed. It is a first filter that will help users.
I don't think (or hope) that anyone would take the license information supplied by CF as legal advice
That is the main issue IMO. We cannot hope, we need to be sure that won't happen and protect ourselves. A disclaimer is probably enough though, but I'm also no a lawyer.
Our license metadata is at best incomplete and at worst downright misleading. We likely have packages that are mislabeled and others that are marked as say BSD-3-Clause but have additional non-commercial clauses or the like attached to them.
The packages that I've spot-checked seem pretty accurate in their reported licenses (not many, admittedly, and at that focus primarily on popular, common projects). I suspect that the majority of downloads correspond to accurate metadata wrt licenses (if only because the majority of downloads target a correspondingly small number of packages that are well-known, widely used, and therefore more carefully vetted).
But more importantly, I think that even the heavily-legalesed corporate world puts a lot of emphasis on "reasonable good-faith efforts" to maintain compliance. It is almost always intentional license violations that persist after the violations are known that are penalized and very rarely an inadvertent violation that is fixed quickly after discovery. (Note that while IANAL, the company I worked for employed several and they emphasized that it was clear ethical lapses rather than honest mistakes that actually exposed the company to legal risk.)
tl;dr - There is still significant value for this effort to companies leveraging conda-forge packages
We have discussed in a few places over the years how to handle user's preferences on licenses in their installed environment. Though we often lacked the tooling and sophistication as an ecosystem to handle this. However I think this has changed recently and would like to present a proposal for discussion about how we might do this.
Now that we make heavy usage of SPDX, one way we might solve this is to create some
_license_*
packages (like_license_MIT
or_license_GPL-2.0-or-later
). These could live as split packages of a single_license
feedstock.When we go through and hot-fix the repodata to translate
license
tolicense_family
entries, we could also add arun
dependency to each package with the appropriate_license_*
dependency. One thing to figure out here would be how we handle custom licenses. Maybe initially we just skip them, but would be good to have a plan for them (do they need to be add to_license_*
? could we have_license_other
?)Packages that have particular features that would result in license changes, could split these out as variants and simply set the
license
metadata based on how this would affect the end package license. It would be up to package maintainers to make sensible choices of features for users and/or PRs from users within this framework.Users then could add
_license_*
s they don't want to Conda'sdisallowed_packages
. This way users won't get packages with those licenses they don't want. Also this could potentially benefit from other Conda solves that use more acceptable licenses.Would be curious to hear others thoughts on this proposal π