conda-forge / conda-forge.github.io

The conda-forge website.
https://conda-forge.org
BSD 3-Clause "New" or "Revised" License
131 stars 274 forks source link

Wrapping Previously Installed Libraries by a Conda Package #1706

Open phreed opened 2 years ago

phreed commented 2 years ago

Discussion Topic: Package type between CDT and Conventional

How do I develop a package which wraps a library not installed by conda? An example of such libraries would be those whose distribution is limited or restricted. e.g.

Context: We write open-source plugins for various proprietary software (mostly modeling tools (UML, SysML, etc.)). While developing (testing) these tools we need to cycle through all the versions of those tools and their libraries. The binaries for these libraries are not freely distributed; the access to these libraries is not a simple download from a url. I can make conda-forge packages for these libraries but the source: url: is problematic.

Case Study: An example is the oracle-jdk; the PR for oracle-jdk package. It illustrates the point, the contributor eventually abandoned it. There is another oracle jdk package which seems to work on linux.

Goal: Develop a consistent way of handling packages which wrap libraries which are not installed using conda.

Ref:

phreed commented 2 years ago

The current approach is to use some combination of post-link, pre-unlink, activate, and deactivate scripts. (Generally, pre-link scripts are discouraged (and is not supported in mamba).) There are some issues with this approach.

The main issue is that any artifacts (files, hardlinks, symlinks, etc.) produced by these scripts are not recorded in the package's meta-data file, e.g. miniconda3/env/conda-meta/foo-0.1.0-1.json. If the artifacts are know at packaging time the can be coded in the pre-unlink script, but if they are discovered at install/link time then they need to be saved somewhere.

Another issue is the setting of environment variables. This is partially handled via the requirements:run: []. The problem is that those environment variables are defined at packaging time. Again, the previously defined variables need to be retained in the activate script and restored in deactivate.

Ref:

phreed commented 2 years ago

Let me give a bit more information about the current approach.

The post-link:pre-unlink' andactivate:deactivatepairs need to share information. Consider a package namedfoo`.

The post-link.bat and activate.bat scripts may write simple scripts which may be called by pre-unlink, and deactivate to revert what they did. Similarly, the pre-link script needs the path to the installed software being wrapped. This can be done via discovery or download. For each of these script files need to be created. These files need to be place in something like the "conda-meta" folder but separate. To that end I created a "conda-meso" folder.

set "CONDA_MESO=%CONDA_PREFIX%\conda-meso\%PKG_NAME%-%PKG_VERSION%_%PKG_BUILDNUM%"
set "DISCOVERY_SCRIPT=%CONDA_MESO%\discovery.bat"
set "UNLINK_SCRIPT=%CONDA_MESO%\pre-unlink-aux.bat"
set "DEACTIVATE_SCRIPT=%CONDA_MESO%\deactivate-aux.bat`

n.b. similar scripts would be written by post-link.sh and activate.sh.

isuruf commented 2 years ago

What's the reason for having this package? How would someone consume this package? Also what's the point of this when there's OpenJDK 8 already in conda-forge?

hmaarrfk commented 2 years ago

I think it is about freedom of choice.

We let people use Nvidia stuff.

phreed commented 2 years ago

@isuruf Let me take your questions in turn.

What is reason for having such a package? I answered this already but I will repeat. When developing and using software it is very common to be required to integrate with other software, often that software is encumbered with restrictive licenses.

How would you consume such a package? Before installing the package you would install the encumbered software in the legally prescribed way. Then you would install the conda package in the convencional way.

What is the point of this package when there is an alternative? First, the alternative is not identical, the OpenJDK does not include javafx. Second, even were they bug for bug compatible the contract under which work is done frequently specifies particular variants and versions.

I hope that wasn’t overly condescending but those are the realities of contract development. As an illustrative example consider the linux program ‘(update-)alternatives’, which does not install any software, rather it activates specific variants and versions. ‘conda’ could do the same in a better (cross-platform, packaged, named-env) way. The value of ‘alternatives’ is pretty clear.

phreed commented 2 years ago

Could someone change this from a question into a discussion.

Part of the discussion is about what packages should be included/excluded in conda-forge. A related part is the value of conda packages generally. It may be that conda-forge should only publish liberally licensed software.

isuruf commented 2 years ago

How would you consume such a package? Before installing the package you would install the encumbered software in the legally prescribed way. Then you would install the conda package in the convencional way.

I don't understand why you need the conda package at all.

phreed commented 2 years ago

I don't understand why you need the conda package at all

As an illustrative example consider the linux program ‘(update-)alternatives’, which does not install any software, rather it activates specific variants and versions. ‘conda’ could do the same in a better (cross-platform, packaged, named-env) way. The value of ‘alternatives’ is pretty clear.

isuruf commented 2 years ago

That's not an illustrative example at all. Can you please describe in detail why you need an empty oracle jdk conda package? How and what would it accomplish anything that not having the conda package and expecting the user to install it externally would not accomplish?

phreed commented 2 years ago

The conda package is not empty, it contains post-link, pre-unlink, activate, and deactivate scripts.

Suppose I have several different variants of java installed on my machine (say 8 to make it concrete). I have several projects I am working on (say 12) each of which needs to be developed and tested to work with each of the java variants. I have several different os (say centos7, centos8, ubuntu 20.04, windows 10, with the linux variants running under WSL2). That three way product can get complicated. How would you propose managing your environments to perform all that testing?

Here are some approaches.

I have been using the first two approaches. That last approach seems like it might be the right way to go.

phreed commented 2 years ago

https://linux.die.net/man/8/update-alternatives

It is possible for several programs fulfilling the same or similar functions to be installed on a single system at the same time. For example, many systems have several text editors installed at once. This gives choice to the users of a system, allowing each to use a different editor, if desired, but makes it difficult for a program to make a good choice of editor to invoke if the user has not specified a particular preference.

The point illustrated with alternatives is that it that package installation is not a necessary part of establishing an environment. The key capability provided by alternatives is the activation of select variants which form that environment.

phreed commented 2 years ago

Is the purpose of conda primarily to install packages or establish named environments? (or something else?)

phreed commented 2 years ago

I just ran across conda virtual packages. I claim that there should be a class of packages that fit between typical conda packages (which both package and activate software) and virtual packages (which package os capabilities). They are similar to virtual packages in that they primarily provide capabilities outside of conda control while making them know to conda.

phreed commented 2 years ago

It seems the main reason to not support this capability is https://conda-forge.org/docs/maintainer/adding_pkgs.html#avoid-external-dependencies

As a general rule: all dependencies have to be packaged by conda-forge as well. This is necessary to assure ABI compatibility for all our packages.

There are only a few exceptions to this rule:

Some dependencies have to be satisfied with CDT packages (see Core Dependency Tree Packages (CDTs)).

Some packages require root access (e.g. device drivers) that cannot be distributed by conda-forge. These dependencies should be avoided whenever possible.

Am I talking about packages that would qualify as "Core Dependency Tree" packages? or is there a third exception?

leofang commented 2 years ago

Java SDK specific discussions aside (which I have zero interest in), based on the issue title

Wrapping Previously Installed Libraries by a Conda Package

and the issue description

Develop a consistent way of handling packages which wrap libraries which are not installed using conda.

Didn't we already provide some way to meet this need? For example there are empty packages for mpich and openmpi: https://conda-forge.org/docs/user/tipsandtricks.html#using-external-message-passing-interface-mpi-libraries

phreed commented 2 years ago

Thank @leofang Those mpi packages look like a great example. I will copy the link to the top.

I have taken a look at the feedstocks.

These packages produce multiple outputs. The https://github.com/conda-forge/openmpi-feedstock/blob/main/recipe/conda_build_config.yaml specifies two types of packages, conda, and external. The external type are empty or dummy packages.

beckermr commented 2 years ago

These are empty on purpose and meant to allow packages to link to the system MPI libraries. These were made to address the very specific needs of HPC users. I don't think this is something we want to support generically for libraries.

phreed commented 2 years ago

@beckermr I am not sure I understand your argument. You concede that there are specific needs for a certain group of users and that those needs were addressed by this technique; then you discount the technique.

It sounds like the technique is viable but its application needs to be carefully defined. It is that definition I am trying to pin down. Oh, and a variant of the technique is applied to CDT as well.

beckermr commented 2 years ago

It needs to be extremely specific and carefully defined, yes.

beckermr commented 2 years ago

I'd also add that it is generally a last resort as opposed to standard practice.

The motivation behind CDTs and the mpi external shims is a combination of

  1. The packages are hard/impossible build, licensing notwithstanding.
  2. There is generally not another way to possibly support this.
  3. There are not other tools withing conda-forge that can do the job.
  4. There is some semblance of ABI stability so this might actually work in most cases.

This is not a strict list of requirements, but it does cover the kinds of things we think about when going down this path.

phreed commented 2 years ago

Those all sound like good reasons. What about licensing issues? I have two types of situations where I would like to use some version of this "wrapping" technique. The family of techniques need a name (names). From GoF the wrapper, decorator, proxy, are candidates.

In the first case I would really like to mark the package as deprecated and refer to the similar package. If it were possible I would prefer that these packages be separated into a distinct repository.

In the second case, things would be tricky, but it is such a common problem I would think it would be worth coming up with something.

phreed commented 2 years ago

@hmaarrfk

I think it is about freedom of choice. We let people use Nvidia stuff.

Here are links to the corresponding Nvidia feedstock.

The difference is that there is a python wrapper in between. https://py3nvml.readthedocs.io/en/latest/

beckermr commented 2 years ago

I have a package with a restrictive license for which there exists a similar package with an open license I have a commercial system for which no alternative exists, and my system must integrate with it

Right, so the issue with these likely comes down to the ABI and builds.

In both the Nvidia case and the MPI case, we've worked out the ABI constraints and how to build our software on a public CI service that can then integrate with the closed source/commercial system.

For example, let's say you have a special compiler, myfancytool. If there is no way to legally download, use for builds, and possibly redistribute runtimes and/or other bits from myfancytool, then we really cannot do much on the conda-forge side. A good example of this are some HPC systems with vendor-specific compilers. We currently don't support that or those compilers due to issues like these.

phreed commented 2 years ago

Suppose commercial or restricted packages, M, F, and H.

Suppose I am developing a plugin for a system M. And, M's plugins are java v8 targets which make use of M's JVM based libraries. Making a wrapper package for M's libraries should have no ABI issues but should be clearly marked with the JVM-v8 marker.

Similarly, if I am developing a plugin for a system N having a plugin API with libraries wrapped in python there should be no problems.

But, if I am developing a plugin for system H which has a C based API that may a big problem. Especially if the C plugin is compiled by a compiler which produces object files which do not conform to a standard calling convention.

Would it be helpful to call out some commercial packages? I have been refraining from mentioning specific systems with the exception of OracleJDK because it is very familiar in the license space. And, it is a system for which I actually need a package.

beckermr commented 2 years ago

At this point, specifics will be the most useful thing. I am not knowledgeable about java and so you'll have to point those specific questions to someone else. I only responded here to help lay out general principles about how we think about these classes of packaging problems.

hmaarrfk commented 2 years ago

An important consideration is what is the reach of the libraries "F" "M" "H"? Can users realistically use them, or is it a niche application.

Niche applications are probably best left for your own channel where you can make things match to your own machine based on the specifics of your installation.

The NVIDIA recipe I'm thinking of is: https://github.com/conda-forge/cudnn-feedstock

We use it to build things like pytorch and tensorflow with GPU support.

phreed commented 2 years ago

I am beginning to think the recommendation should be to place packages like these in alternate channels. Of course that means I will need to learn how to stand up my own channel :-) If you agree, I suppose the best way to wrap up this discussion is with a recommendation about working with non-conda-forge channels (that remain compatible with conda-forge). Then these packages could be candidates for inclusion into conda-forge.

phreed commented 2 years ago

If you want to share the package there are choices besides conda-forge.

The nexus repo manager only supports conda-proxy, not hosted. GitLab is still in the planning phase, possibly planning on incorporating quetz Azure looks like it might work. An Anaconda organization looks like the default approach. Are there any hosted quetz servers?

hmaarrfk commented 2 years ago

Just use anaconda's hosted solution. For small packages, it is enough. If you don't have name clashes with conda-forge you stay compatible.

conda config --add channels YOUR_CHANNEL_NAME

then install.

You can even make your customers pay for it.

I upload many packages https://anaconda.org/mark.harfouche/repo

that I need for myself. See my non-standard libc package.

phreed commented 2 years ago

Next point. Should meta.yaml be modified to perform external discovery in some standard way? e.g. For the OracleJDK but others would be similar.

...
discovery:
  home: 'C:\Program Files\Java\jdk1.8.0_(\d*)-.*'   # [win]
  home: '/usr/java/jdk1.8.0_(\d*)-.*'   # [not win]
  max: $1
  bin: $home/bin
  lib: $home/lib
  include: $home/include
  env: 
    JAVA_HOME=$home
  test:
    if exist $home\bin\java    # [win]
    test -f $home/bin/java    # [not win]
  instructions:
    "If the test fails show these instructions about how to install the discoverable package.
...

This information would be used during activation to verify whether the package is installed. If installed it would be added to the environment.

Would it be possible to include in the test whether ABI requirements are met?

beckermr commented 2 years ago

IMHO no we should not support that. It is a lot of code to maintain for edgecases.

phreed commented 2 years ago

The issue, then, is whether it is an edge-case, or just rare because it is difficult to get it right. Quoting @hmaarrfk

However, it is tricky to get these packages merged into conda-forge. For example cuda 11.6 still isn't in, however, it exists in nvidia's channels and has for a while. Mostly due to the difficulty in testing downstream effects which are "hidden" from conda-forge.

hmaarrfk commented 2 years ago

As I've said before, I don't know why you need to do "discovery" prematurely. You can do it all in an activate script. It doesn't seem to me that you need new features.

hmaarrfk commented 2 years ago

There are virtual packages in conda: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html

But I'm not sure you will be able to convince people that you need a virtual package for each piece of proprietary software.

Nor do I think it solves your usecase, of wanting to switch between versions.

phreed commented 2 years ago

I am presuming that either conda activate (or conda install) would make use of the new information to perform discovery and checks on what amount to virtual packages. In other words, it would not be premature. The issue with writing your own activate script is that you need a corresponding deactivate script. The documentation recommends against writing activate scripts, I believe, for this very reason.

phreed commented 2 years ago

What I am proposing is to first develop a way of constructing activate and deactivate scripts that do not have the problems outlined in this discussion. This could be done by constructing them from a shared discovery.yaml. Then (and only then) consider incorporating that discovery yaml into the meta.yaml and updating conda-build and conda.

hmaarrfk commented 2 years ago

My understanding is that using activate and deactivate scripts is to discourage the use of external dependencies.

You are however knowingly going against this advice because you have a particular need to do it.

I wouldn't try to get them perfect on day 1 for your own use-case. Conda took a while to get nested environment right. I'm not sure they are completely correct today.

The conda-smithy tooling is available to help you create your own "feedstocks" and use azure to deploy them to your anaconda channel. You will have 95% of the benefits of integrating it in conda-forge.

phreed commented 2 years ago

The conda-smithy tooling is available to help you create your own "feedstocks" and use azure to deploy them to your anaconda channel. You will have 95% of the benefits of integrating it in conda-forge.

I do not see the documentation for doing what you are describing. Can you supply a link?

phreed commented 2 years ago

This? https://medium.com/@lgouarin/conda-smithy-an-easy-way-to-build-and-deploy-your-conda-packages-12216058648f

hmaarrfk commented 2 years ago

See final bullets of https://github.com/conda-forge/conda-smithy#making-a-new-feedstock

phreed commented 2 years ago

I am having a similar issue with an older version of Qt. https://forum.qt.io/topic/136288/distribution-of-qt-via-conda-forge