NVIDIA / numbast

Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.
Apache License 2.0
15 stars 6 forks source link

Reduce conda package dependencies #62

Open isVoid opened 3 months ago

isVoid commented 3 months ago

closes #53

isVoid commented 3 months ago

And now aarch64 instance passes, but x86 instance fails. X86 still pulls

cuda-nvcc                 12.4.131                      0    nvidia
isVoid commented 3 months ago

Self note: the current conda test instance is only testing 1 conda package among the 3.9, 3.10, 3.11 builds.

msarahan commented 2 months ago

The runtime use of compilers was always kind of messy in conda. Sorry about that. I don't think you should put cuda-nvcc_{{target_platform}} in your run dependencies. cuda-nvcc should suffice, but the version constraint does seem important. In the nvidia channel repodata, you can see that the v12.4 cuda-nvcc has no dependencies at all:

    "cuda-nvcc-12.4.99-0.tar.bz2": {
      "build": "0",
      "build_number": 0,
      "depends": [],
      "md5": "4ae1a7f76505916c3781ce5a9e35cab6",
      "name": "cuda-nvcc",
      "sha256": "409b5a6edb16e2d75129eaea525dcd7db6a9a42942f3d238ed5d69e7cdf0829a",
      "size": 65643693,
      "subdir": "linux-64",
      "timestamp": 1709083273542,
      "version": "12.4.99"
    },
    "cuda-nvcc-12.5.40-0.tar.bz2": {
      "build": "0",
      "build_number": 0,
      "depends": [
        "cuda-nvcc_linux-64 12.5.40.*",
        "gcc_linux-64",
        "gxx_linux-64"
      ],
      "license": "LicenseRef-NVIDIA-End-User-License-Agreement",
      "md5": "24b9715663c23f0bb8e2565ba192d69f",
      "name": "cuda-nvcc",
      "sha256": "fc93ac4e9dc5091508d468064dd361e07ae3e62f1c26e46c0bc6f96935a992d0",
      "size": 16632,
      "subdir": "linux-64",
      "timestamp": 1713410246105,
      "version": "12.5.40"
    },

This means that it won't conflict with any other cuda packages, and it's free to fill in a constraint opportunistically. I don't know the new solver as well as the old one, but the old one explicitly tried to minimize the number of packages that need to be changed. I suspect that kind of criterion is at play here.

If cuda-nvcc has a range of GCC versions that it needs, I think it should be capturing those somehow in the dependencies for the cuda-nvcc metapackage. Do we need to patch that repodata? What's the standard for fixing metadata on the nvidia channel?

bdice commented 2 months ago

@msarahan Were you able to try these suggestions? I experimented along similar lines when I reproduced CI conda builds locally and did not have much luck. Convincing the solver to behave as I expected seems to be more complicated than I would have anticipated. This PR may require more hands-on work.

edit: I added some comments above, hope that helps point in the right direction.

msarahan commented 2 months ago

Here's what I'm seeing:

conda create --dry-run -n blah -c conda-forge -c nvidia "cuda-version>=12.5" cuda-nvcc "clangdev>=18"

This makes me thing that something is wrong with the constraints for the cuda-nvcc packages. I went ahead and created the environment and looked through the JSON files in conda-meta for the environment. cuda-nvcc-dev_linux-64-12.5.82-ha770c72_0.json shows that the constraint on gcc_impl_linux-64 is:

"constrains": [
    "gcc_impl_linux-64 >=6,<14.0a0"
  ],

which explains why the goofy early GCC versions were allowed, unless you add the additional gcc version constraint. Same story for the undesirable GCC 13 install. So I'll return to my earlier point/question: what is imposing the functional constraint on GCC? Is it ast_canopy, or is it nvcc itself? If ast_canopy is more sensitive to GCC version than nvcc, then the constraint in this recipe is fine (but I'd probably add a comment about it). If NVCC's GCC bounds are too broad, then that's a problem in that recipe, and it needs to be changed and repodata-hotfixed (or else the too-broad bounds will continue causing issues).

As for why cuda-nvcc 12.4 from the nvidia channel keeps showing up, it has no runtime contstraints. It can satisfy the cuda-nvcc dependency in meta.yaml without having any effect on anything else. If anything about cuda-nvcc 12.5 is somehow incompatible with the other cuda packages or gcc packages, then going back to the cuda-nvcc 12.4 package from nvidia channel is an easy way out.

Per @bdice's comments above, I think it works ok, but maybe I don't understand well enough what a valid environment is in this situation.

conda create --dry-run -n ast_canopy_test2 -c conda-forge -c nvidia "cuda-version>=12.5" cuda-nvcc "gcc>=10,<13" "clangdev>=18"

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge 
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu 
  _sysroot_linux-64~ conda-forge/noarch::_sysroot_linux-64_curr_repodata_hack-3-h69a702a_16 
  binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
  binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_0 
  clang              conda-forge/linux-64::clang-18.1.8-default_h9e3a008_1 
  clang-18           conda-forge/linux-64::clang-18-18.1.8-default_hf981a13_1 
  clang-format       conda-forge/linux-64::clang-format-18.1.8-default_hf981a13_1 
  clang-format-18    conda-forge/linux-64::clang-format-18-18.1.8-default_hf981a13_1 
  clang-tools        conda-forge/linux-64::clang-tools-18.1.8-default_hf981a13_1 
  clangdev           conda-forge/linux-64::clangdev-18.1.8-default_hf981a13_1 
  clangxx            conda-forge/linux-64::clangxx-18.1.8-default_h3d5eb1d_1 
  cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.5.39-ha770c72_0 
  cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.5.82-ha770c72_0 
  cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.5.82-ha770c72_0 
  cuda-cudart        conda-forge/linux-64::cuda-cudart-12.5.82-he02047a_0 
  cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.5.82-he02047a_0 
  cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.5.82-h85509e4_0 
  cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.5.82-he02047a_0 
  cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.5.82-h85509e4_0 
  cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.5.82-h85509e4_0 
  cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.5.82-h85509e4_0 
  cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.5.82-hcdd1206_0 
  cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.5.82-ha770c72_0 
  cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.5.82-hd3aeb46_0 
  cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.5.82-hd3aeb46_0 
  cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.5.82-h8a487aa_0 
  cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.5.82-ha770c72_0 
  cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.5.82-h59595ed_0 
  cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.5.82-h59595ed_0 
  cuda-version       conda-forge/noarch::cuda-version-12.5-hd4f0392_3 
  gcc                conda-forge/linux-64::gcc-12.4.0-h236703b_0 
  gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-12.4.0-hb2e57f8_0 
  gcc_linux-64       conda-forge/linux-64::gcc_linux-64-12.4.0-h6b7512a_0 
  gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-12.4.0-h557a472_0 
  gxx_linux-64       conda-forge/linux-64::gxx_linux-64-12.4.0-h8489865_0 
  icu                conda-forge/linux-64::icu-75.1-he02047a_0 
  kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-h4a8ded7_16 
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.40-hf3520f5_7 
  libclang           conda-forge/linux-64::libclang-18.1.8-default_hf981a13_1 
  libclang-cpp       conda-forge/linux-64::libclang-cpp-18.1.8-default_hf981a13_1 
  libclang-cpp18.1   conda-forge/linux-64::libclang-cpp18.1-18.1.8-default_hf981a13_1 
  libclang13         conda-forge/linux-64::libclang13-18.1.8-default_h9def88c_1 
  libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-12.4.0-ha4f9413_100 
  libgcc-ng          conda-forge/linux-64::libgcc-ng-14.1.0-h77fa898_0 
  libgomp            conda-forge/linux-64::libgomp-14.1.0-h77fa898_0 
  libiconv           conda-forge/linux-64::libiconv-1.17-hd590300_2 
  libllvm18          conda-forge/linux-64::libllvm18-18.1.8-h8b73ec9_1 
  libsanitizer       conda-forge/linux-64::libsanitizer-12.4.0-h46f95d5_0 
  libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-12.4.0-ha4f9413_100 
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-14.1.0-hc0a3c3a_0 
  libxml2            conda-forge/linux-64::libxml2-2.12.7-he7c6b58_4 
  libzlib            conda-forge/linux-64::libzlib-1.3.1-h4ab18f5_1 
  llvm-tools         conda-forge/linux-64::llvm-tools-18.1.8-h8b73ec9_1 
  llvmdev            conda-forge/linux-64::llvmdev-18.1.8-h8b73ec9_1 
  sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h4a8ded7_16 
  tzdata             conda-forge/noarch::tzdata-2024a-h0c530f3_0 
  xz                 conda-forge/linux-64::xz-5.2.6-h166bdaf_0 
  zstd               conda-forge/linux-64::zstd-1.5.6-ha6fb4c9_0 
bdice commented 2 months ago

If NVCC's GCC bounds are too broad, then that's a problem in that recipe, and it needs to be changed and repodata-hotfixed (or else the too-broad bounds will continue causing issues).

Yes, this is the line of questioning we need to explore. The GCC bounds on the cuda-nvcc-dev_linux-64 package are determined by the CUDA Host Compiler Support Policy. The cuda-nvcc package is behaving as it should, according to CUDA compiler support policies. However, something is causing the fallback to GCC 7, when it should prefer a newer GCC. I added the bounds on gcc >=10 in this recipe just to force the solver away from that undesired solution. The fallback to GCC 7 suggests to me that there has been a change in GCC packaging that the solver decides is worth falling back for. I think we need to figure out if this is a GCC packaging bug causing the fallback, or at least understand why it is choosing GCC 7. Once we diagnose that, we could try to escalate that with conda-forge, or we could just narrow the cuda-nvcc bounds on gcc and use a repodata hotfix (e.g. if conda-forge is not interested in patching old compilers), or we could just say the scope of the problem is narrow enough that we only want to impose a narrower gcc constraint on the packages in this repo for now.

As for why cuda-nvcc 12.4 from the nvidia channel keeps showing up, it has no runtime contstraints. It can satisfy the cuda-nvcc dependency in meta.yaml without having any effect on anything else. If anything about cuda-nvcc 12.5 is somehow incompatible with the other cuda packages or gcc packages, then going back to the cuda-nvcc 12.4 package from nvidia channel is an easy way out.

This is also a problem. We don't want to force users to use CUDA 12.5+, but I don't think we have a way to make earlier versions work as intended, so long as the nvidia channel is present (which is also a requirement). We want to be able to use this package alongside RAPIDS (which uses conda-forge packages and CUDA 12.0-12.5) and/or PyTorch (which uses nvidia packages and CUDA 12.1). Our goal is to come up with a set of run dependencies like conda create --dry-run -n blah -c conda-forge -c nvidia "cuda-version=12.4" cuda-nvcc "clangdev>=18" that solves with CUDA 12.4 (or anything older than 12.5) from conda-forge (since the nvidia packages are not useful prior to 12.5). However, I don't see a solution for this.