Help mamba decide on CUDA variant

wolfv commented 2 years ago

Mamba currently can't properly decide which tensorflow cuda package to prefer since we don't directly depend on cudatoolkit in the tensorflow package.

Only in the tensorflow-base package we have a cudatoolkit dependency.

However, tensorflow-base is connected to tensorflow via an exact dependency.

If we'd also add the cudatoolkit to the tensorflow package, mamba would be able to select the best version.

OR we could remove the tensorflow variant completely and just have a single metapackage that depends on tensorflow-base {{ VERSION }} (but not exact). Then mamba could decide the tensorflow-base version more freely.

wolfv commented 2 years ago

What happens currently is that mamba "randomly" picks cuda 11.1 (even though cuda 11.2 would probably be preffered).

183amir commented 2 years ago

I don't think the issue is here. Mamba should be able to handle this instead.

h-vetinari commented 2 years ago

I don't think the issue is here. Mamba should be able to handle this instead.

As long as we're not compromising on the quality of the packaging (i.e. everything works as it should), I don't see why we shouldn't accommodate mamba here. Yes, it would be nicer if it were fixed in mamba, but it's used more and more (not least in CF-CI), so keeping things knowingly suboptimal has to have a commensurate benefit for not doing it (especially as the workaround is really not painful), IMO.

wolfv commented 2 years ago

@183amir the problem in this case is that we have a variant package (tensorflow) that restricts the cuda version of downstream packages. But we have no way of choosing which tensorflow package is the best without doing a global optimization.

We could change this in the following ways:

have only tensorflow pyXX meta-package (without CUDA, or even without Python!), then the tensorflow packages depends on tensorflow-base 2.6 (pyXX) and variant selection would work fine
Add a direct dependency to the same cudatoolkit from the build string, then variant selection would also work fine :)

Global optimization, with the amount of variant packages that exist in conda-forge, takes a looong time (because many branches have to be explored). I think it's good to make it as simple as possible for the solver to find the best solution.

wolfv commented 2 years ago

I tried to write down how the sorting and solving works here: https://mamba.readthedocs.io/en/latest/advanced_usage/package_resolution.html

183amir commented 2 years ago

As long as we're not compromising on the quality of the packaging (i.e. everything works as it should), I don't see why we shouldn't accommodate mamba here.

I agree.

have only tensorflow pyXX meta-package (without CUDA, or even without Python!), then the tensorflow packages depends on tensorflow-base 2.6 (pyXX) and variant selection would work fine

I don't think this is a good solution because the exact dependency on tensorflow-base is there to make sure the split packages are all installed from one build. That is, tensorflow, tensorflow-estimator, etc. are installed from one build.

Add a direct dependency to the same cudatoolkit from the build string, then variant selection would also work fine :)

I am not sure what this means. Could you please explain more? Maybe you could show how the recipe would look like.

izahn commented 2 years ago

This might be a separate topic, but do we actually need both tensorflow and tensorflow-base packages? Currently the only difference between mamba create -d -n test "tensorflow=*=cuda112*" and mamba create -d -n test "tensorflow-base=*=cuda112*" is that the first one installs tensorflow and tensorflow-estimator. The same is true for "tensorflow-base=*=cuda112*" vs. "tensorflow-base=*=cuda112*". Unless there is a common need to install tensorflow without tensorflow-estimator that doesn't make much sense to me.

wolfv commented 2 years ago

So right now we have the following situation:

tensorflow-py39-cuda112
depends on
- python 3.9
- tensorflow-base py39-cuda112 (heavily constrained, can only use a single tensorflow-base!)
tensorflow-py38-cuda112
- depends on python 3.8
- tensorflow-base py38-cuda112
tensorflow-py39-cuda111
- depends on python 3.8
- tensorflow-base py39-cuda111

So here, mamba can correctly decide for tensorflow-py39... but cannot select the cuda variant because it cannot judge by inspecting the first order dependencies.

If we add cudatoolkit at this level, it woudl work fine. Otherwise, we could modify it like this:

tensorflow-py39
- depends on python 3.8
- tensorflow-base py39*
   can be resolved to 
   - tensorflow-base py39-cuda102
   - tensorflow-base py39-cuda110
   - tensorflow-base py39-cuda111
   - tensorflow-base py39-cuda112

Then again we could resolve properly, because we can decide again for tensorflow-base and select the one that uses the highest cudatoolkit.

I don't know if this makes sense? It's kinda hard to explain.

Basically with the current setup we would need to look at all tensorflow packages, and then at all tensorflow-base packages to figure out which one is best.

What I am proposing is to

either make it possible to decide in tensorflow (by adding the cudatoolkit as dependency)
or to leave the decision when deciding for tensorflow-base (by removing the exact pin)

Eliminating tensorflow-base would also fix this :)

hmaarrfk commented 2 years ago

I kinda feel like we should add cuda-toolkit to tensorflow. I'm not too sure why tensorflow-base still exists today. I think it was historically there to help with circular dependencies with estimator

xhochy commented 2 years ago

I think it was historically there to help with circular dependencies with estimator

Yes!

hmaarrfk commented 2 years ago

I'm sorry, do we want to add :

cudatoolkit or
cudnn
nccl

to the host section

wolfv commented 2 years ago

cudatoolkit is the one whose version is in the variant, right? so that should be enough to help mamba decide

hmaarrfk commented 2 years ago

I guess i don't see cudatoolkit anywhere in this recipe or in that of the pytorch recipe. I feel like we should make this recipe a model to follow for other recipes.

wolfv commented 2 years ago

I think it's exported from compiler('cuda')?

hmaarrfk commented 2 years ago

great. makes sense. let's add that.

it seems like we should be adding it to libtensorflow_cc as well right?

hmaarrfk commented 2 years ago

Ok, I've folded it in the recent migraiton.

h-vetinari commented 2 years ago

IIUC, we could get rid of tensorflow-base now that the estimator is folded into this feedstock?

How about we do this for 2.7.0?

hmaarrfk commented 2 years ago

we could do that.... if downstream packages are not depending on it.

There seems to be quite a few things on github that are now depending on it already. So we kinda have to carry it forward.

h-vetinari commented 2 years ago

If it's still needed, we could still keep it as a compatibility output that just 1:1 installs tensorflow, ideally together with an activation script that warns to change the dependence away from tensorflow-base.

hmaarrfk commented 2 years ago

We had tried to keep tensorflow as simply installing tensorflow-base exactly pinned, but we needed to add more constraints to it.

I'm not sure there is an easy answer to this.

h-vetinari commented 2 years ago

If we don't need tensorflow-base anymore to break circular dependencies with tf-estimator, then we can just fully absorb it into tensorflow in the recipe, and then add a new output tensorflow-base as a thin wrapper around tensorflow. I can't see how that wouldn't work...

hmaarrfk commented 2 years ago

In your sentence, swap tensorflow-base and tensorflow and that is what we had 2 weeks ago.

xhochy commented 2 years ago

I introduced tensorflow-base only to get tensorflow-estimator building inside the feedstock here. Will still need that or do we have a workaround for the cyclic dependency?

h-vetinari commented 2 years ago

In your sentence, swap tensorflow-base and tensorflow and that is what we had 2 weeks ago.

You said above:

We had tried to keep tensorflow as simply installing tensorflow-base exactly pinned, but we needed to add more constraints to it.

Why not still add those constraints and then call it tensorflow / tensorflow-base (whichever way the wrapper goes)?

hmaarrfk commented 2 years ago

Maybe i'm just confused. It might be just easier if you show me in a PR.

Can we agree on an order of priorities? I suggest:

Get the migration moving foward #144
Build windows (cpu only)
Cleanup the recipe (with what you are suggesting).

h-vetinari commented 2 years ago

Can we agree on an order of priorities?

I agree that #144 is priority number 1, and as I said I'm fine to clean up the tf-base situation together with 2.7.0. I don't think we should make these improvements dependent on windows builds, which has a large number of known & unknown unknowns.

IOW, from your list, I suggest 1. -> 3. -> 2.

hmaarrfk commented 2 years ago

has this been resolved?

conda-forge / tensorflow-feedstock

Help mamba decide on CUDA variant #162