GLP-2: License is not an SPDX identifier

nick-youngblut commented 3 years ago

For packages with a GLP-2 license, conda_r_skeleton_helper generates a meta.yaml with license: GLP-2, but this failed the conda-forge linter with the error:

License is not an SPDX identifier (or a custom LicenseRef) nor an SPDX license expression.

A simple change to GPL-2.0-or-later instead of GLP-2 should be all that is needed.

bgruening commented 3 years ago

@nick-youngblut nice catch! Are you going to do a PR?

nick-youngblut commented 3 years ago

Sure. I can look into the code and make the change

nick-youngblut commented 3 years ago

@bgruening should I add a check that the license: is one of the valid licenses:

Apache-2.0
Apache-2.0 WITH LLVM-exception
BSD-3-Clause
BSD-3-Clause OR MIT
GPL-2.0-or-later
LGPL-2.0-only OR GPL-2.0-only
LicenseRef-HDF5
MIT
MIT AND BSD-2-Clause
PSF-2.0

... or just include a regex to change license: GPL-2 to license: GPL-2.0-or-later?

bgruening commented 3 years ago

I think a check is a nice idea! Thanks!

nick-youngblut commented 3 years ago

OK. I've added the GPL-2 replacement regex and an SPDX license check for both run.py and run.R: https://github.com/bgruening/conda_r_skeleton_helper/pull/49

bgruening commented 3 years ago

Thanks!

jdblischak commented 3 years ago

@nick-youngblut Thanks for identifying and fixing this issue!

I think this is a good reminder though to bring up a subject we've discussed in the past: this helper script was supposed to be temporary. Ideally most of this should be upstreamed, especially the license stuff. Using regexes to edit unstructured text is so fragile. conda-forge has proved itself to be an important part of the scientific computing ecosystem. Couldn't we get a student, e.g. Google Summer of Code, to implement some flags for the skeleton script like --no-comments, --use-spdx, etc.?

nick-youngblut commented 3 years ago

Given the rigidity of CRAN for R packages, I'm surprised that one cannot create a direct conversion of all CRAN packages to conda recipes. Maybe the underlying non-R dependencies make this unrealistic, except maybe with a pre-trained deep learning language model trained on all existing CRAN => conda_recipe data.

jdblischak commented 3 years ago

Maybe the underlying non-R dependencies make this unrealistic, except maybe with a pre-trained deep learning language model trained on all existing CRAN => conda_recipe data.

It can get tricky, especially when dealing with compiled code. There are resources like remotes::system_requirements() that provide more structured data on the non-R dependencies, so a mapping of these to available conda recipes would be useful.

Though just to be clear, I'm not advocating that we attempt to get all of CRAN onto conda-forge. I'm happy with the current demand-driven model where users only add the packages that they need.

nick-youngblut commented 3 years ago

Though just to be clear, I'm not advocating that we attempt to get all of CRAN onto conda-forge. I'm happy with the current demand-driven model where users only add the packages that they need.

I'm guessing this is why many people stick with CRAN for R package installation instead of using conda. If all pypi and CRAN packages were available via conda, then conda could become a universal software manager for data scientists that utilize R, python, and various command line software. The current unnecessary divide in python and R package development and management (eg., CRAN and pypi/conda, or Jupyter vs RStudio) generates unnecessary hurdles for data scientists that just want to use the best tool for the job, regardless of what language it was written in.

jdblischak commented 3 years ago

If all pypi and CRAN packages were available via conda, then conda could become a universal software manager for data scientists that utilize R, python, and various command line software

@nick-youngblut You make a very compelling argument!

One issue is that we know we can't get to 100%, especially on Windows. We already have many examples where we can't the build to work on Windows because of missing system dependencies. To make it universal, we'd need to use a trick like the bpsm package does, where it overrides the default install.packages() function. This allows it to install Debian/Ubuntu binaries when available, and otherwise install directly from CRAN.

nick-youngblut commented 3 years ago

One issue is that we know we can't get to 100%, especially on Windows.

Given that most data science is done on a linux/unix OS, and users that only have access to a Windows machine can use linux via a VM, dual boot, or free/cheap cloud-based services, why must windows be supported?

dbast commented 3 years ago

The right place to fix the license mapping is a PR against the license code block starting here https://github.com/conda/conda-build/blob/3b99b2222a067e113a2282926871cd1e5406ee2b/conda_build/skeletons/cran.py#L1521

jdblischak commented 3 years ago

why must windows be supported?

@nick-youngblut I don't recommend Windows for scientific computing, but from a purely practical standpoint, many people start their programming journey on Windows. I certainly did. If the goal is to viewed as "universal", I think we should try to support Windows as much as is reasonable. Though we are kind of getting away from my point: we're never going to get 100% of CRAN packages converted to conda (at least not on a volunteer basis). While Windows is the most problematic, there are also missing dependencies on macOS, so having a convenient way to fall back to CRAN would be ideal.

The right place to fix the license mapping is a PR against the license code block starting here

@dbast Thanks for the pointer! I recognize that code 😄 Though I think implementing SPDX identifiers is going to be more of a social issue than a technical one. The potential use of SPDX for licenses as been discussed at least as far back as 2017, e.g. https://github.com/conda/conda/issues/5280, and it's never been implemented. The new grayskull replacement supports SPDX, but it's unclear when R support is going to be added. Maybe we could add a flag, e.g. --use-spdx, to the existing cran skeleton to allow the optional use of SPDX identifiers?

nick-youngblut commented 3 years ago

I don't recommend Windows for scientific computing, but from a purely practical standpoint, many people start their programming journey on Windows

I get your point, but in this new age of free/cheap cloud computing, does any have to start their programming journey on their own machine? One could argue that various cloud-based services make it much easier for new programmers to get started.

Though we are kind of getting away from my point: we're never going to get 100% of CRAN packages converted to conda (at least not on a volunteer basis)

I agree, without a very sophisticated automated method, which is likely too complex to attempt right now, given the current state of AI (eg., massive amounts of training required, which still doesn't result in logical reasoning).

dbast commented 3 years ago

@jdblischak Why do you think changing the mapping is a social topic? conda-forge is anyway happy with spdx and I don't think anybody else would have objections ...

@nick-youngblut one of the reasons why conda is popular is the fact, that it can create consistent environments in userspace (without admin rights) across operating systems... you can't imagine how many users in enterpise / bank companies are stuck with windows...

@all there is a conda-build PR that does mapping for system requirements... https://github.com/conda/conda-build/pull/3826 that enables large build outs of cran packages with very little intervention. maybe somebody finds the time to finish / rebase it.

jdblischak commented 3 years ago

Why do you think changing the mapping is a social topic?

@dbast Because the majority of the discussion in that Issue I linked to was about consensus, backwards compatibility, and possibly creating a separate license field for the SPDX identifier. And the fact that it's been 4 years and nothing has been implemented.

conda-forge is anyway happy with spdx and I don't think anybody else would have objections ...

I agree conda-forge has standardized on it, but we're not the only users of the conda-build skeletons (nor do we have write access to the conda-build repo AFAIK). That's why I suggested a flag like --use-spdx. That would allow conda-forge users to use the SPDX identifiers without breaking backwards compatibility.

you can't imagine how many users in enterpise / bank companies are stuck with windows...

I've also found myself in the situation of a locked-down Windows machine (fortunately only temporarily until I was given admin rights), and I was very grateful to be able to quickly bootstrap a working data science environment with conda.

there is a conda-build PR that does mapping for system requirements

Very cool! Thanks for bringing this to our attention. From skimming the code, it seems like it parses the SystemRequirements field, and then looks it up in a dependency mapping file. Is the idea to translate something like the existing sysreqsdb to conda packages?

dbast commented 3 years ago

@jdblischak Times have changed ... if you look at feedstocks at https://github.com/AnacondaRecipes you can see that lots/most of them have spdx license strings specified for license: .... As now both sides the community and the defaults recipes use spdx, it makes sense that the skeletons receive some updates.

Yes, something like sysreqsdb ... (it would be interesting to extend sysreqdb to conda and use it inside the skeleton) ... doing then a large build out of cran would also mean to aggregate multiple recipes in one repo as done by https://github.com/AnacondaRecipes/aggregateR or bioconda... otherwise conda-forge ends up with >10k new feedstock repos.

This can be all done step by step.. the ideas, concepts and unfinished code already exist to be picked up.

jdblischak commented 3 years ago

Times have changed

@dbast Agreed! And not only the licenses. I see you are now at Anaconda. Congrats on the new job!

otherwise conda-forge ends up with >10k new feedstock repos.

I would love if we could have fewer repos. My inbox is inundated with conda-forge notifications, and I find it overwhelming (I haven't had much luck adjusting my GitHub email notifications settings. If anyone knows of a way to receive notifications for direct mentions and not team mentions, I would love to know how to do this).

This can be all done step by step.. the ideas, concepts and unfinished code already exist to be picked up.

I'm inspired! Though I think these efforts would need some central coordination, especially if we want to move all the R packages into a single repo instead of individual feedstocks. Do you have the bandwidth to coordinate this? Maybe we start a discussion at https://github.com/orgs/conda-forge/teams/r to gauge interest and availability?

I had written off the existing cran skeleton since it was my understanding that grayskull was the future, e.g. https://github.com/conda-incubator/grayskull/issues/7. But grayskull is still pre-1.0, and I see that CRAN support was added as a milestone for version 2.0: https://github.com/conda-incubator/grayskull/milestone/2 Thus it seems like it still makes sense to continue investing in improvements to the existing skeleton.

dbast commented 3 years ago

@jdblischak Thanks! I am happy to help here with coordination. Let's continue the discussion at https://github.com/orgs/conda-forge/teams/r

bgruening / conda_r_skeleton_helper

GLP-2: License is not an SPDX identifier #48