Open isVoid opened 3 months ago
And now aarch64
instance passes, but x86
instance fails. X86 still pulls
cuda-nvcc 12.4.131 0 nvidia
Self note: the current conda test instance is only testing 1 conda package among the 3.9, 3.10, 3.11 builds.
The runtime use of compilers was always kind of messy in conda. Sorry about that. I don't think you should put cuda-nvcc_{{target_platform}}
in your run dependencies. cuda-nvcc
should suffice, but the version constraint does seem important. In the nvidia channel repodata, you can see that the v12.4 cuda-nvcc has no dependencies at all:
"cuda-nvcc-12.4.99-0.tar.bz2": {
"build": "0",
"build_number": 0,
"depends": [],
"md5": "4ae1a7f76505916c3781ce5a9e35cab6",
"name": "cuda-nvcc",
"sha256": "409b5a6edb16e2d75129eaea525dcd7db6a9a42942f3d238ed5d69e7cdf0829a",
"size": 65643693,
"subdir": "linux-64",
"timestamp": 1709083273542,
"version": "12.4.99"
},
"cuda-nvcc-12.5.40-0.tar.bz2": {
"build": "0",
"build_number": 0,
"depends": [
"cuda-nvcc_linux-64 12.5.40.*",
"gcc_linux-64",
"gxx_linux-64"
],
"license": "LicenseRef-NVIDIA-End-User-License-Agreement",
"md5": "24b9715663c23f0bb8e2565ba192d69f",
"name": "cuda-nvcc",
"sha256": "fc93ac4e9dc5091508d468064dd361e07ae3e62f1c26e46c0bc6f96935a992d0",
"size": 16632,
"subdir": "linux-64",
"timestamp": 1713410246105,
"version": "12.5.40"
},
This means that it won't conflict with any other cuda packages, and it's free to fill in a constraint opportunistically. I don't know the new solver as well as the old one, but the old one explicitly tried to minimize the number of packages that need to be changed. I suspect that kind of criterion is at play here.
If cuda-nvcc has a range of GCC versions that it needs, I think it should be capturing those somehow in the dependencies for the cuda-nvcc metapackage. Do we need to patch that repodata? What's the standard for fixing metadata on the nvidia channel?
@msarahan Were you able to try these suggestions? I experimented along similar lines when I reproduced CI conda builds locally and did not have much luck. Convincing the solver to behave as I expected seems to be more complicated than I would have anticipated. This PR may require more hands-on work.
edit: I added some comments above, hope that helps point in the right direction.
Here's what I'm seeing:
conda create --dry-run -n blah -c conda-forge -c nvidia "cuda-version>=12.5" cuda-nvcc "clangdev>=18"
This makes me thing that something is wrong with the constraints for the cuda-nvcc packages. I went ahead and created the environment and looked through the JSON files in conda-meta for the environment. cuda-nvcc-dev_linux-64-12.5.82-ha770c72_0.json
shows that the constraint on gcc_impl_linux-64 is:
"constrains": [
"gcc_impl_linux-64 >=6,<14.0a0"
],
which explains why the goofy early GCC versions were allowed, unless you add the additional gcc version constraint. Same story for the undesirable GCC 13 install. So I'll return to my earlier point/question: what is imposing the functional constraint on GCC? Is it ast_canopy, or is it nvcc itself? If ast_canopy is more sensitive to GCC version than nvcc, then the constraint in this recipe is fine (but I'd probably add a comment about it). If NVCC's GCC bounds are too broad, then that's a problem in that recipe, and it needs to be changed and repodata-hotfixed (or else the too-broad bounds will continue causing issues).
As for why cuda-nvcc 12.4 from the nvidia channel keeps showing up, it has no runtime contstraints. It can satisfy the cuda-nvcc
dependency in meta.yaml without having any effect on anything else. If anything about cuda-nvcc
12.5 is somehow incompatible with the other cuda packages or gcc packages, then going back to the cuda-nvcc 12.4 package from nvidia channel is an easy way out.
Per @bdice's comments above, I think it works ok, but maybe I don't understand well enough what a valid environment is in this situation.
conda create --dry-run -n ast_canopy_test2 -c conda-forge -c nvidia "cuda-version>=12.5" cuda-nvcc "gcc>=10,<13" "clangdev>=18"
_libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
_openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
_sysroot_linux-64~ conda-forge/noarch::_sysroot_linux-64_curr_repodata_hack-3-h69a702a_16
binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7
binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_0
clang conda-forge/linux-64::clang-18.1.8-default_h9e3a008_1
clang-18 conda-forge/linux-64::clang-18-18.1.8-default_hf981a13_1
clang-format conda-forge/linux-64::clang-format-18.1.8-default_hf981a13_1
clang-format-18 conda-forge/linux-64::clang-format-18-18.1.8-default_hf981a13_1
clang-tools conda-forge/linux-64::clang-tools-18.1.8-default_hf981a13_1
clangdev conda-forge/linux-64::clangdev-18.1.8-default_hf981a13_1
clangxx conda-forge/linux-64::clangxx-18.1.8-default_h3d5eb1d_1
cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.5.39-ha770c72_0
cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.5.82-ha770c72_0
cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.5.82-ha770c72_0
cuda-cudart conda-forge/linux-64::cuda-cudart-12.5.82-he02047a_0
cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.5.82-he02047a_0
cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.5.82-h85509e4_0
cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.5.82-he02047a_0
cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.5.82-h85509e4_0
cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.5.82-h85509e4_0
cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.5.82-h85509e4_0
cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.5.82-hcdd1206_0
cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.5.82-ha770c72_0
cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.5.82-hd3aeb46_0
cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.5.82-hd3aeb46_0
cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.5.82-h8a487aa_0
cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.5.82-ha770c72_0
cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.5.82-h59595ed_0
cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.5.82-h59595ed_0
cuda-version conda-forge/noarch::cuda-version-12.5-hd4f0392_3
gcc conda-forge/linux-64::gcc-12.4.0-h236703b_0
gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-12.4.0-hb2e57f8_0
gcc_linux-64 conda-forge/linux-64::gcc_linux-64-12.4.0-h6b7512a_0
gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-12.4.0-h557a472_0
gxx_linux-64 conda-forge/linux-64::gxx_linux-64-12.4.0-h8489865_0
icu conda-forge/linux-64::icu-75.1-he02047a_0
kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-h4a8ded7_16
ld_impl_linux-64 conda-forge/linux-64::ld_impl_linux-64-2.40-hf3520f5_7
libclang conda-forge/linux-64::libclang-18.1.8-default_hf981a13_1
libclang-cpp conda-forge/linux-64::libclang-cpp-18.1.8-default_hf981a13_1
libclang-cpp18.1 conda-forge/linux-64::libclang-cpp18.1-18.1.8-default_hf981a13_1
libclang13 conda-forge/linux-64::libclang13-18.1.8-default_h9def88c_1
libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-12.4.0-ha4f9413_100
libgcc-ng conda-forge/linux-64::libgcc-ng-14.1.0-h77fa898_0
libgomp conda-forge/linux-64::libgomp-14.1.0-h77fa898_0
libiconv conda-forge/linux-64::libiconv-1.17-hd590300_2
libllvm18 conda-forge/linux-64::libllvm18-18.1.8-h8b73ec9_1
libsanitizer conda-forge/linux-64::libsanitizer-12.4.0-h46f95d5_0
libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-12.4.0-ha4f9413_100
libstdcxx-ng conda-forge/linux-64::libstdcxx-ng-14.1.0-hc0a3c3a_0
libxml2 conda-forge/linux-64::libxml2-2.12.7-he7c6b58_4
libzlib conda-forge/linux-64::libzlib-1.3.1-h4ab18f5_1
llvm-tools conda-forge/linux-64::llvm-tools-18.1.8-h8b73ec9_1
llvmdev conda-forge/linux-64::llvmdev-18.1.8-h8b73ec9_1
sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h4a8ded7_16
tzdata conda-forge/noarch::tzdata-2024a-h0c530f3_0
xz conda-forge/linux-64::xz-5.2.6-h166bdaf_0
zstd conda-forge/linux-64::zstd-1.5.6-ha6fb4c9_0
If NVCC's GCC bounds are too broad, then that's a problem in that recipe, and it needs to be changed and repodata-hotfixed (or else the too-broad bounds will continue causing issues).
Yes, this is the line of questioning we need to explore. The GCC bounds on the cuda-nvcc-dev_linux-64
package are determined by the CUDA Host Compiler Support Policy. The cuda-nvcc package is behaving as it should, according to CUDA compiler support policies. However, something is causing the fallback to GCC 7, when it should prefer a newer GCC. I added the bounds on gcc >=10
in this recipe just to force the solver away from that undesired solution. The fallback to GCC 7 suggests to me that there has been a change in GCC packaging that the solver decides is worth falling back for. I think we need to figure out if this is a GCC packaging bug causing the fallback, or at least understand why it is choosing GCC 7. Once we diagnose that, we could try to escalate that with conda-forge, or we could just narrow the cuda-nvcc bounds on gcc and use a repodata hotfix (e.g. if conda-forge is not interested in patching old compilers), or we could just say the scope of the problem is narrow enough that we only want to impose a narrower gcc constraint on the packages in this repo for now.
As for why cuda-nvcc 12.4 from the nvidia channel keeps showing up, it has no runtime contstraints. It can satisfy the
cuda-nvcc
dependency in meta.yaml without having any effect on anything else. If anything aboutcuda-nvcc
12.5 is somehow incompatible with the other cuda packages or gcc packages, then going back to the cuda-nvcc 12.4 package from nvidia channel is an easy way out.
This is also a problem. We don't want to force users to use CUDA 12.5+, but I don't think we have a way to make earlier versions work as intended, so long as the nvidia
channel is present (which is also a requirement). We want to be able to use this package alongside RAPIDS (which uses conda-forge
packages and CUDA 12.0-12.5) and/or PyTorch (which uses nvidia
packages and CUDA 12.1). Our goal is to come up with a set of run dependencies like conda create --dry-run -n blah -c conda-forge -c nvidia "cuda-version=12.4" cuda-nvcc "clangdev>=18"
that solves with CUDA 12.4 (or anything older than 12.5) from conda-forge
(since the nvidia
packages are not useful prior to 12.5). However, I don't see a solution for this.
closes #53