conda-forge / conda-forge.github.io

The conda-forge website.
https://conda-forge.org
BSD 3-Clause "New" or "Revised" License
131 stars 275 forks source link

Providing user's license choice #1608

Open jakirkham opened 2 years ago

jakirkham commented 2 years ago

We have discussed in a few places over the years how to handle user's preferences on licenses in their installed environment. Though we often lacked the tooling and sophistication as an ecosystem to handle this. However I think this has changed recently and would like to present a proposal for discussion about how we might do this.

Now that we make heavy usage of SPDX, one way we might solve this is to create some _license_* packages (like _license_MIT or _license_GPL-2.0-or-later). These could live as split packages of a single _license feedstock.

When we go through and hot-fix the repodata to translate license to license_family entries, we could also add a run dependency to each package with the appropriate _license_* dependency. One thing to figure out here would be how we handle custom licenses. Maybe initially we just skip them, but would be good to have a plan for them (do they need to be add to _license_*? could we have _license_other?)

Packages that have particular features that would result in license changes, could split these out as variants and simply set the license metadata based on how this would affect the end package license. It would be up to package maintainers to make sensible choices of features for users and/or PRs from users within this framework.

Users then could add _license_*s they don't want to Conda's disallowed_packages. This way users won't get packages with those licenses they don't want. Also this could potentially benefit from other Conda solves that use more acceptable licenses.

Would be curious to hear others thoughts on this proposal πŸ™‚

hmaarrfk commented 2 years ago

One challenge i see is with applications like git (GPL) that might need to be installed at the same time as a package that users don't want GPL in their own library, but want to call it from a shell.

It would be good to specify that a certain application tree can be "gpl" but an other application tree should be not-gpl within a same environment.

However, I totally understrand how complicated that seems and that it might just be simpler to ask those users to create a separate environment for git.

jakirkham commented 2 years ago

Conda does allow environment stacking. Maybe this is one way to go about this?

h-vetinari commented 2 years ago

The general approach seems surprisingly straightforward (didn't know about disallowed_packages), πŸ‘

Still, I'd advocate for minimum viable complexity here.

When we go through and hot-fix the repodata to translate license to license_family entries, [...]

Where do the license_families come in in your proposal? A distinction like _license_GPL-2.0-or-later would already be more granular than the license family. On the other hand, there are many custom licenses that fall under the BSD-family. Perhaps having only license families would be a decent trade-off? I.e. no GPL-version selector, but also a much more manageable list of _license_* packages (not just for maintainers, but also users).

One thing to figure out here would be how we handle custom licenses.

SPDX covers a very broad range of licenses already; in almost no-case did I not find what I was looking for. I think leaving "OTHER" would be fine in the beginning. If someone doesn't like it, they can improve the feedstock metadata.

Also this could potentially benefit from other Conda solves that use more acceptable licenses.

What do you have in mind here? Talking about solving - usable solver errors should be a goal.

jakirkham commented 2 years ago

Sorry license_family is a bit of a distraction. Was just trying to point out we already do hot-fixing with license as input to that process (the output being setting the license_family field). This would just be a different output in that hot-fixing process. Anyways agree just using license should be sufficient

Yeah we do have a few cases of custom licenses like this in conda-forge. Would search for LicenseRef and they will come up.

Right the idea would be if a user excludes all GPLs and tries to install a package that comes in GPL and non-GPL flavors, the non-GPL package gets installed.

h-vetinari commented 2 years ago

Anyways agree just using license should be sufficient

And I was contemplating if license_family would be a better target. Surely someone will interject that they want the GPL v2-vs-v3 split reflected, but my point was that it's a bit of a numbers game. If we had ~20 license packages based on the families, or ~200 based on the individual licenses, I know which one I'd prefer (better numbers needed, I made those up).

Yeah we do have a few cases of custom licenses like this in conda-forge. Would search for LicenseRef and they will come up.

Pillow would actually already be a good demonstration here. Its dependencies already have 5 custom licenses (one of which is classed as BSD).

Right the idea would be if a user excludes all GPLs and tries to install a package that comes in GPL and non-GPL flavors, the non-GPL package gets installed.

I get the mechanics. I'm wondering if the solver errors would be decipherable by an end-user (because the solver error doesn't always point to the right source of conflict).

jakirkham commented 2 years ago

So I tried a little experiment. It seems disallowed_packages is not quite as built-out as we might hope. Should add both Conda and Mamba behave the same.

Anyways made the following change to ~/.condarc

disallowed_packages:
  - readline

Then tried to install ipython (note readline is a dependency of python):

``` $ conda create -n tst ipython -y Collecting package metadata (current_repodata.json): done Solving environment: done ## Package Plan ## environment location: /Users/jkirkham/miniconda/envs/tst added / updated specs: - ipython The following NEW packages will be INSTALLED: appnope conda-forge/osx-64::appnope-0.1.2-py310h2ec42d9_2 asttokens conda-forge/noarch::asttokens-2.0.5-pyhd8ed1ab_0 backcall conda-forge/noarch::backcall-0.2.0-pyh9f0ad1d_0 backports conda-forge/noarch::backports-1.0-py_2 backports.functoo~ conda-forge/noarch::backports.functools_lru_cache-1.6.4-pyhd8ed1ab_0 black conda-forge/noarch::black-22.1.0-pyhd8ed1ab_0 bzip2 conda-forge/osx-64::bzip2-1.0.8-h0d85af4_4 ca-certificates conda-forge/osx-64::ca-certificates-2021.10.8-h033912b_0 click conda-forge/osx-64::click-8.0.3-py310h2ec42d9_1 dataclasses conda-forge/noarch::dataclasses-0.8-pyhc8e2a94_3 decorator conda-forge/noarch::decorator-5.1.1-pyhd8ed1ab_0 executing conda-forge/noarch::executing-0.8.2-pyhd8ed1ab_0 ipython conda-forge/osx-64::ipython-8.0.1-py310h2ec42d9_0 jedi conda-forge/osx-64::jedi-0.18.1-py310h2ec42d9_0 libffi conda-forge/osx-64::libffi-3.4.2-h0d85af4_5 libzlib conda-forge/osx-64::libzlib-1.2.11-h9173be1_1013 matplotlib-inline conda-forge/noarch::matplotlib-inline-0.1.3-pyhd8ed1ab_0 mypy_extensions conda-forge/osx-64::mypy_extensions-0.4.3-py310h2ec42d9_4 ncurses conda-forge/osx-64::ncurses-6.3-he49afe7_0 openssl conda-forge/osx-64::openssl-3.0.0-h0d85af4_2 parso conda-forge/noarch::parso-0.8.3-pyhd8ed1ab_0 pathspec conda-forge/noarch::pathspec-0.9.0-pyhd8ed1ab_0 pexpect conda-forge/noarch::pexpect-4.8.0-pyh9f0ad1d_2 pickleshare conda-forge/noarch::pickleshare-0.7.5-py_1003 pip conda-forge/noarch::pip-22.0.3-pyhd8ed1ab_0 platformdirs conda-forge/noarch::platformdirs-2.5.0-pyhd8ed1ab_0 prompt-toolkit conda-forge/noarch::prompt-toolkit-3.0.27-pyha770c72_0 ptyprocess conda-forge/noarch::ptyprocess-0.7.0-pyhd3deb0d_0 pure_eval conda-forge/noarch::pure_eval-0.2.2-pyhd8ed1ab_0 pygments conda-forge/noarch::pygments-2.11.2-pyhd8ed1ab_0 python conda-forge/osx-64::python-3.10.2-hea1dfa3_3_cpython python_abi conda-forge/osx-64::python_abi-3.10-2_cp310 readline conda-forge/osx-64::readline-8.1-h05e3726_0 setuptools conda-forge/osx-64::setuptools-60.8.2-py310h2ec42d9_0 six conda-forge/noarch::six-1.16.0-pyh6c4a22f_0 sqlite conda-forge/osx-64::sqlite-3.37.0-h23a322b_0 stack_data conda-forge/noarch::stack_data-0.1.4-pyhd8ed1ab_0 tk conda-forge/osx-64::tk-8.6.11-h5dbffcc_1 tomli conda-forge/noarch::tomli-2.0.1-pyhd8ed1ab_0 traitlets conda-forge/noarch::traitlets-5.1.1-pyhd8ed1ab_0 typed-ast conda-forge/osx-64::typed-ast-1.5.2-py310he24745e_0 typing_extensions conda-forge/noarch::typing_extensions-4.0.1-pyha770c72_0 tzdata conda-forge/noarch::tzdata-2021e-he74cb21_0 wcwidth conda-forge/noarch::wcwidth-0.2.5-pyh9f0ad1d_2 wheel conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0 xz conda-forge/osx-64::xz-5.2.5-haf1e3a3_1 zlib conda-forge/osx-64::zlib-1.2.11-h9173be1_1013 Preparing transaction: done Verifying transaction: failed DisallowedPackageError: The package 'conda-forge/osx-64::readline-8.1-h05e3726_0' is disallowed by configuration. See 'conda config --show disallowed_packages'. ```

In short the message is clear, but disallowed_packages is not considered when solving an environment (only when executing the install). So there is some work needed to get that functionality. Issue ( https://github.com/conda/conda/issues/7526 ) appears to be relevant.

hmaarrfk commented 2 years ago

An other important package for the ecosystem is ffmpeg many python programs do call it from the command line. Do you suggest they used stacked environments too?

hmaarrfk commented 2 years ago

I'm just trying to get a handle on a full workflow.

jakirkham commented 2 years ago

Maybe? Really just thinking through options here.

If the usage model is a user has a collection of things that are CLI only, maybe putting them in another environment that gets stacked on top of is reasonable.

One advantage of this is one can easily separate out the CLI only bits when building new stuff to avoid accidental linking of libraries from there.

BastianZim commented 2 years ago

Two cents from someone in a corporate environment and maintainer.

Big +1 from my side as this approach doesn't seem to require any involvement from a maintainer or user side if you don't use it which means that this can be rolled out silently and with full backward compatibility.

Regarding the solver error etc: Since using this will require active participation from the corporations using this (just using corporations as a blanket term here) I think it's fair to ask them to go to the docs etc. to figure out what exactly the error means, how this can be stacked etc. Of course, it would be nice to have a more intuitive approach but this seems to already do the trick at least most of the time so I wouldn't throw out the baby with the bathwater and not implement anything.

Re what to use: Since SPDX has everything in a GitHub repo (https://github.com/spdx/license-list-data) it should be simple enough to have a script that runs every so often and converts every single license into a package. Since we already have a _license - licensefamily mapping it seems also possible to have the script generate those automatically as well. Then everyone can decide by themselves which key they prefer. For the custom licenses, it might be interesting to see if we can generate a list of them using CF tools somehow, otherwise, I would go for other right now or manually create a list for the most common ones.

jakirkham commented 2 years ago

Thanks Bastian! πŸ˜„ It's helpful to have that perspective from companies looking for this feature.

Glad to hear this is still useful even with some unfinished edges. Admittedly those could be finished with a bit of effort as it is deemed important.

Good point on providing license and license_family as options. Maybe the latter can be named _license_family_*?

Yeah there is probably a way to extract info about custom licenses. If a catchall (like other), is good enough maybe we can start with that.

jakirkham commented 2 years ago

cc @conda-forge/core (in case others have thoughts on this πŸ™‚)

BastianZim commented 2 years ago

Good point on providing license and license_family as options. Maybe the latter can be named _licensefamily*?

+1 Would probably make the most sense.

Also linking this gitter discussion from today as it seems like there is a real use case for a package that can provide licenses: https://gitter.im/conda-forge/conda-forge.github.io?at=620e828a6e4c1e1c846ccd54 Should also be easy to implement, the only thing to check is the license of the license-shorthands but that seems to be CC0-1.0

beckermr commented 2 years ago

I want to weigh in here on a core underlying concern I have.

Our license metadata is at best incomplete and at worst downright misleading. We likely have packages that are mislabeled and others that are marked as say BSD-3-Clause but have additional non-commercial clauses or the like attached to them.

As an organization, we cannot provide any assurances whatsoever that any of these licenses metadata fields, tags, metapackages, etc. produce environments that actually have only the associated licenses.

Given that the context here is corporate users with license restrictions, I very much doubt any of the proposed things here would actually be suitable at the required standard.

hmaarrfk commented 2 years ago

I don't think of this as much as an assurance, but more of giving people a tool to follow their interpretation of the different copyright stuff.

Things labeled as BSD but have non commercial clauses should not be labelled BSD....

TBH It gives people with full time jobs more justification to maintain packages beyond those strictly necessary on conda forge.

ocefpaf commented 2 years ago

but more of giving people a tool to follow their interpretation of the different copyright stuff.

Sure. This will help out folks who need to inspect and select licenses. I do believe this is useful. My only concern is to add a proper message/disclaimer stating that we are not responsible for the metadata and not liable for the information there.

beckermr commented 2 years ago

Things labeled as BSD but have non commercial clauses should not be labelled BSD....

Of course they shouldn't be. That's not the issue. The issue is that we have 15k feedstocks with even more outputs and we cannot promise that things are correct.

dopplershift commented 2 years ago

I agree with @ocefpaf that there is utility in having the functionality, even if it's not perfect--but important to have the disclaimer that we make no guarantees to the accuracy for legal purposes.

Another fun corner case would be ones with dual licensing of both the AND and OR variety.

jakirkham commented 2 years ago

TBH It gives people with full time jobs more justification to maintain packages beyond those strictly necessary on conda forge.

I want to make sure this point doesn't get lost in the (justifiable) skepticism of us pulling this off. The goal here is to make it easier for other participants to get involved in the ecosystem. This ability to constrain licensing seems to be one of the obstacles and it has come up in discussions with others before ( for example https://github.com/conda-forge/conda-forge.github.io/issues/209 ).

Agree adding a disclaimer is a good idea.

BastianZim commented 2 years ago

One other thing from my side about the validity.

I don't think (or hope) that anyone would take the license information supplied by CF as legal advice so I don't see a risk there. But a prominent disclaimer would probably be beneficial just in case someone decides to blame CF (IANAL though).

What this would make easier though is the "preselection" of packages.

For example: If I want to install a couple of hundred packages legal will need to check all of them and then I will need to find alternatives for the incorrectly licensed ones which could be up to 100%.

If I can specify a license and download only the appropriate ones, I would assume that maybe at least 60%-70% are packages that already have the correct licenses and I only need to find substitutes for 30% of the packages (Where legal found out that the CF supplied license is incorrect). Of course, there is still work involved in double checking the licenses and finding substitutes for incorrectly labelled licenses but the number of substitutes I need to find decreases (hopefully) significantly.

ocefpaf commented 2 years ago

What this would make easier though is the "preselection" of packages.

Indeed. It is a first filter that will help users.

I don't think (or hope) that anyone would take the license information supplied by CF as legal advice

That is the main issue IMO. We cannot hope, we need to be sure that won't happen and protect ourselves. A disclaimer is probably enough though, but I'm also no a lawyer.

swails commented 2 years ago

Our license metadata is at best incomplete and at worst downright misleading. We likely have packages that are mislabeled and others that are marked as say BSD-3-Clause but have additional non-commercial clauses or the like attached to them.

The packages that I've spot-checked seem pretty accurate in their reported licenses (not many, admittedly, and at that focus primarily on popular, common projects). I suspect that the majority of downloads correspond to accurate metadata wrt licenses (if only because the majority of downloads target a correspondingly small number of packages that are well-known, widely used, and therefore more carefully vetted).

But more importantly, I think that even the heavily-legalesed corporate world puts a lot of emphasis on "reasonable good-faith efforts" to maintain compliance. It is almost always intentional license violations that persist after the violations are known that are penalized and very rarely an inadvertent violation that is fixed quickly after discovery. (Note that while IANAL, the company I worked for employed several and they emphasized that it was clear ethical lapses rather than honest mistakes that actually exposed the company to legal risk.)


tl;dr - There is still significant value for this effort to companies leveraging conda-forge packages