Remove cuda from default virtual packages? #426

minrk commented 1 year ago


What happened?

I noticed while installing pytorch-cpu that it pulled in the cuda variant of libhwloc and thereby cudatoolkit, doubling the size of my image. I tracked it down to the default virtual package spec assuming all machines are likely to have cuda by default.

I was able to solve it with a custom virtual packages spec (capturing the virtual packages from conda info in the base image), but it seems to make more sense to me for cuda to be opt-in instead of opt-out, since it's unvailable more often than not, and the cost of an incorrect assumption of its presence is high (massive size increase, non-working packages), while the cost of missing it is low (reduced performance or informative, immediate error if cuda is actually required to install). Or is there a consideration I'm missing?

Conda Info

active environment : base
    active env location : /opt/conda
            shell level : 1
       user config file : /home/user/.condarc
 populated config files : /opt/conda/.condarc
          conda version : 23.3.1
    conda-build version : not installed
         python version :
       virtual packages : __archspec=1=x86_64
       base environment : /opt/conda  (writable)
      conda av data dir : /opt/conda/etc/conda
  conda av metadata url : None
           channel URLs :
          package cache : /opt/conda/pkgs
       envs directories : /opt/conda/envs
               platform : linux-64
             user-agent : conda/23.3.1 requests/2.31.0 CPython/3.10.11 Linux/5.15.0-1031-gcp ubuntu/22.04.2 glibc/2.35
                UID:GID : 1000:100
             netrc file : None
           offline mode : False

Conda Config

==> /opt/conda/.condarc <==
auto_update_conda: False
  - conda-forge
show_channel_urls: True

Conda list

Additional Context

A simple env that produces the issue is:

  - conda-forge
  - tbb

which includes libhwloc cuda variant and cudatoolkit by default on linux and Windows.

maresb commented 1 year ago

I don't really use cuda much, so hopefully @mariusvniekerk can chime in. It seems reasonable to me, although a fairly significant breaking change.

It anyways looks like we're using CUDA 11.4 virtual package in our fake repodata, even though the latest is 12.1. Perhaps there are multiple todos here on the virtual package front?

minrk commented 1 year ago

I think probably the biggest counter argument is that most cuda things explicitly depend on cuda, and some of those environments will no longer be solvable without specifying a virtual package spec anymore, and it's uncommon for things that don't depend on cuda that have a cuda variant. I get the impression that this is becoming less true, though. I just happen to be using one of those packages (torch-cpu -> mkl -> tbb -> hwloc -> cudatoolkit).

maresb commented 1 year ago

Would it make sense to introduce --with-cuda and --without-cuda flags? And then if something CUDA is installed and these flags aren't specified, then we emit a warning asking the user to be explicit?

minrk commented 1 year ago

That would mean keeping cuda in the default virtual packages (otherwise it would fail to solve and we wouldn't get to a warning), then checking for e.g. cudatoolkit in the result and warning if cuda was left unspecified (no virtual packages, no --with[out]-cuda)? That seems like reasonable behavior. A bit more complex to implement, but not too bad.

maresb commented 1 year ago

This seems to be quite doable to me, and it sounds like this would have prevented the need for you to debug the image size.

I'd be happy to accept a PR. I'm a bit time-constrained, but I might be able to get to this several months from now.

minrk commented 1 year ago

I'll have a look if I get a chance.