jakirkham commented 1 year ago

With CUDA 11 support in conda-forge, the cudatoolkit package provided both the CUDA libraries and a way to express what CUDA version a particular package was built with.

As part of adding CUDA 12, packages are restructured to split out each CUDA library. This means each CUDA library will be a separate dependency. However this restructuring results in some information not being captured:

What CUDA compiler version a package was built with?
Which CUDA versions a package would support (what CUDA driver version requirements might exist)?
How a particular CUDA library is tied to a specific CUDA version?

Additionally there is a question about how CUDA version tracking support ties into CUDA compatibility

Raising this issue so we can discuss and decide how we want to leverage this package

jakirkham commented 1 year ago

cc @adibbley @bdice @robertmaynard (since we were discussing this last week)

cc @kkraus14 (in case you have thoughts here)

Also please feel free to include others

jakirkham commented 1 year ago

cc @raydouglass @ajschmidt8

jakirkham commented 1 year ago

A related question is whether we want an option to pull in dependencies that align with a particular CUDA Toolkit release. IOW cuda-version=12.0 would ensure any CUDA libraries installed would be aligned with with CUDA Toolkit 12.0

leofang commented 1 year ago

What CUDA compiler version a package was built with?

I hope this is not of concern to cuda-version. Doesn't conda-build have a way to keep track of the used version of {{ compiler("XXX") }} (regardless if XXX is c, cxx, cuda, or @bdice's cuda11/cuda12)? I hope to keep the status quo. We need to use this information to compute the package hash.

Which CUDA versions a package would support (what CUDA driver version requirements might exist)?

+1 on this:

if a package needs a minimal (or a range of) driver version, pin at __cuda
if a package needs a minimal (or a range of) runtime version, pin at cuda-version

How a particular CUDA library is tied to a specific CUDA version?

I would think this is equivalent to the above question.

Additionally there is a question about how CUDA version tracking support ties into CUDA compatibility

ditto, I would think this is equivalent to the above question.

A related question is whether we want an option to pull in dependencies that align with a particular CUDA Toolkit release. IOW cuda-version=12.0 would ensure any CUDA libraries installed would be aligned with with CUDA Toolkit 12.0

I would hope this continues to be the job of a meta package (cudatoolkit or cuda-toolkit), as it is today.

bdice commented 1 year ago

Yup, I think I align with @leofang on pretty much everything above.

At the high level, I was expecting that every CUDA package would depend on an exact pinning like cuda-version==12.0.

IOW cuda-version=12.0 would ensure any CUDA libraries installed would be aligned with with CUDA Toolkit 12.0

Yes, users can constrain their installed CUDA packages (e.g. math libraries) with conda install [...CUDA packages...] cuda-version==12.0, or downstream packages can require cuda-version>=12.1 for CUDA runtime requirements. As @leofang said above, the driver side of things would be handled by __cuda metapackage requirements.

cuda-version brings some of the same benefits as the versioning of the metapackage like cuda-toolkit=12.0, but without requiring all the toolkit packages to be installed. It enforces consistency -- the CUDA Toolkit is disaggregated into independent packages now, but I am under the impression that the packages should come from the same CUDA Toolkit release and not be mixed-and-matched. Is that accurate? I assume you shouldn't be able to install 12.0 compilers, 12.1 math libraries, and 12.2 nvJitLink -- and that there would be no real use case for such a situation. Making all the CUDA packages depend on a particular cuda-version constrains all CUDA packages in a given conda environment to be released as part of the same CUDA Toolkit version, which sounds like a good thing.

leofang commented 1 year ago

cuda-version brings some of the same benefits as the versioning of the metapackage like cuda-toolkit=12.0, but without requiring all the toolkit packages to be installed.

Actually, I think this can be nicely addressed:

Add cuda-version ==X.Y to cuda-toolkit's run requirement
Add all runtime libraries to cuda-toolkit's run_constrained requirement

The (anticipated) effect is:

By just installing cuda-toolkit, which is empty, there's no disk/network usage
Once any CUDA component is installed, either explicitly by the user or implicitly by other packages' requirements, the compatibility check among cuda-version and cuda-toolkit is enforced by the conda solver

The only missing piece I haven't figured out is a command or something for users to express "I want to actually install everything that cuda-toolkit would offer in the old days," without listing them explicitly like conda install A B C D E ... Z 😅 But perhaps it's not that important?

leofang commented 1 year ago

but I am under the impression that the packages should come from the same CUDA Toolkit release and not be mixed-and-matched. Is that accurate?

I would agree. I've asked a math lib team before, and they don't do any mix-and-match tests so far, so there's always a risk that this doesn't work out of box. So,

Making all the CUDA packages depend on a particular cuda-version constrains all CUDA packages in a given conda environment to be released as part of the same CUDA Toolkit version, which sounds like a good thing.

Yes, I'd argue that let's start from a tighter constraint first, make sure it works for the whole community, and once the GPU CI is up, we conduct studies and decide if we can relax some of the above assumptions. Going from tight to loose is always safer than the opposite direction 🙂

leofang commented 1 year ago

I should add that I made an implicit assumption above that any CUDA component released at CTK X.Y will have

run:
  - cuda-version >=X.Y,<X+1

i.e. I set its lower bound to the CTK version that it comes with (which is a tighter requirement than what minor version compatibility would guarantee).

leofang commented 1 year ago

Additionally there is a question about how CUDA version tracking support ties into CUDA compatibility

ditto, I would think this is equivalent to the above question.

Sorry, I spoke too fast, we need cuda-version of version X.Y to depend on __cuda >=X.0. That a CTK requires a minimal driver version has been always the case, what's different here is that we relax to X.0 to declare CTK ver X has minor version compatibility (=new CTK, old driver, both within the same major release), as opposed to setting __cuda >=X.Y.

jakirkham commented 1 year ago

So would we want to change this?

https://github.com/conda-forge/cuda-version-feedstock/blob/df29c50bde844c13ebc2f4dc5ec072fcbd830611/recipe/meta.yaml#L17-L19

kkraus14 commented 1 year ago

My 2c, ultimately when it comes to CUDA packages users will want a few different behaviors:

The ability to get a specific shipped full toolkit at a given version, i.e. 12.0 or 12.1.
- This would presumably be handled by either the cuda-toolkit or cuda package?
The ability to get specific packages at a given toolkit version, i.e. 12.0 or 12.1.
- This would presumably be handled by a combination of this cuda-version package along with some type of run_constrained?
The ability to get specific packages not tied to a given toolkit version, i.e. I may want to get Thrust 2.0.1
- I'm unclear if we want to allow this for only specific packages that have historically and/or explicitly supported building and running on different CUDA versions like Thrust, CUB, and libcu++ or more generally for all of the cuda packages.
Not to allow solving to an incompatible environment, i.e. libcusolver depends on libcublas where if they both require the same minor / patch version, there should be appropriate pinning to enforce such.

I'm okay with the proposal of making all of the packages tied to a given toolkit version as long as Thrust, CUB, and libcu++ are excluded. It would be nice if we still gave a path for users to request "Thrust version shipped with CUDA Toolkit 12.0" though.

leofang commented 1 year ago

So would we want to change this?

I guess not? This could allow CPU users to install it. (So, I wasn't being accurate enough to just say "depend on", I should have said "setting run_constrained to be", which is what we do now. 🙂)

jakirkham commented 1 year ago

Ah was more meaning whether this should become (though maybe this doesn't matter much)

  requirements: 
    run_constrained: 
-     - __cuda >={{ major_version }}
+     - __cuda >={{ major_version }}.0

Though yeah run_constrained for CPU users (also us for building) and older Conda's are considerations for run_constrained

leofang commented 1 year ago

@kkraus14

The ability to get a specific shipped full toolkit at a given version, i.e. 12.0 or 12.1.

Yes, as mentioned above I haven't figured this out.

The ability to get specific packages at a given toolkit version, i.e. 12.0 or 12.1.

I hope this is addressed by my proposal above.

The ability to get specific packages not tied to a given toolkit version, i.e. I may want to get Thrust 2.0.1

I am not certain about this. Perhaps, if we don't list cuda-version to any of its requirement, this would be possible? Or if we adopt a relaxed scheme for it (see below)?

I'm okay with the proposal of making all of the packages tied to a given toolkit version as long as Thrust, CUB, and libcu++ are excluded.

Yeah it sounds like you want them to not have cuda-version listed in run/run_constrained? How about a more relaxed scheme for these CCCL components, such as

- run_constrained:
  - cuda-version >=X.0

for the CCCL components released in any CTK ver X or X+1? IIUC this aligns with CCCL's (presumed) goal.

leofang commented 1 year ago

Ah was more meaning whether this should become (though maybe this doesn't matter much)

@jakirkham I didn't know the patch has any semantic difference, if it has then this is the right patch I'd think.

kkraus14 commented 1 year ago

Yeah it sounds like you want them to not have cuda-version listed in run/run_constrained? How about a more relaxed scheme for these CCCL components, such as
- run_constrained:
  - cuda-version >=X.0
for the CCCL components released in any CTK ver X or X+1? IIUC this aligns with CCCL's (presumed) goal.

The only challenge for this is how do I say "I want to install Thrust that shipped with CUDA Toolkit 12.0. The cuda-version constraint would still be specified with newer versions.

bdice commented 1 year ago

cuda-version brings some of the same benefits as the versioning of the metapackage like cuda-toolkit=12.0, but without requiring all the toolkit packages to be installed.

Actually, I think this can be nicely addressed:

Add cuda-version ==X.Y to cuda-toolkit's run requirement

Add all runtime libraries to cuda-toolkit's run_constrained requirement

The (anticipated) effect is:

By just installing cuda-toolkit, which is empty, there's no disk/network usage

[...] The only missing piece I haven't figured out is a command or something for users to express "I want to actually install everything that cuda-toolkit would offer in the old days," without listing them explicitly like conda install A B C D E ... Z 😅 But perhaps it's not that important?

We should keep cuda-toolkit as a "full" metapackage, not a "network install on-demand." Users should be able to install cuda-toolkit and build software from source that leverages the full CUDA toolkit. If we make cuda-toolkit only a constraint package, then users only get the real CUDA compilers/libraries/etc. by installing them manually or installing other packages that depend on them. That's out of alignment with what it means to "install the CUDA toolkit" with any other package manager or from the CUDA Downloads page.

bdice commented 1 year ago

I should add that I made an implicit assumption above that any CUDA component released at CTK X.Y will have
run:
  - cuda-version >=X.Y,<X+1
i.e. I set its lower bound to the CTK version that it comes with (which is a tighter requirement than what minor version compatibility would guarantee).

I think we want the CUDA component packages released with CTK X.Y to pin to cuda-version==X.Y exactly (not a range). Otherwise the cuda-version package isn't able to constrain correctly. If a user pins to cuda-version==12.1, packages for cuda-cudart from 12.0 should not satisfy that constraint.

Users downstream would be able to specify a dependency on cuda-version>=X.Y if they require that runtime version.

Sorry, I spoke too fast, we need cuda-version of version X.Y to depend on __cuda >=X.0. [...]That a CTK requires a minimal driver version has been always the case, what's different here is that we relax to X.0 to declare CTK ver X has minor version compatibility (=new CTK, old driver, both within the same major release), as opposed to setting __cuda >=X.Y.

We actually need no dependency on __cuda in the CUDA toolkit packages, from my understanding. The CUDA toolkit should be installable without having a CUDA driver, right? In other words, we should be able to install cuda-nvcc and related packages without a GPU present.

edit: More specifically, I saw @jakirkham point to run_constrained: __cuda>={{ major_version }} in cuda-version. That is problematic if you have a system with an older GPU (doesn't support CUDA 12) but want to build CUDA 12 packages. CPU-only machines would satisfy this because they have no __cuda and thus no constraint -- but ideally CPU-only and old-GPU machines would both be usable for building (not running) with CUDA 12.

__cuda definition: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html

In the past, I've recommended using __cuda to help constrain whether CPU/GPU packages are fetched. https://github.com/openmm/openmm/issues/3059#issuecomment-797023166

__cuda also has important implications for CI systems building without GPUs: https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/144/

leofang commented 1 year ago

We should keep cuda-toolkit as a "full" metapackage, not a "network install on-demand."

Sure, this would save me from having to answer this question:

The only missing piece I haven't figured out is a command or something for users to express "I want to actually install everything that cuda-toolkit would offer in the old days," without listing them explicitly like conda install A B C D E ... Z 😅 But perhaps it's not that important?

with the trade-off that all components are installed. As long as we agree this is the right design and document it. I am no problem.

leofang commented 1 year ago

I think we want the CUDA component packages released with CTK X.Y to pin to cuda-version==X.Y exactly (not a range). Otherwise the cuda-version package isn't able to constrain correctly. If a user pins to cuda-version==12.1, packages for cuda-cudart from 12.0 should not satisfy that constraint.

Users downstream would be able to specify a dependency on cuda-version>=X.Y if they require that runtime version.

@bdice, I don't understand now 😅 Say cuBLAS from CTK 12.1 tightly pins cuda-version==12.1. How am I going to allow downstream users or package maintainers to take advantage of minor version compatibility and declare any cuBLAS from CTK 12 is fine? I thought you agreed that

if a package needs a minimal (or a range of) runtime version, pin at cuda-version

bdice commented 1 year ago

How am I going to allow downstream users or package maintainers to take advantage of minor version compatibility and declare any cuBLAS from CTK 12 is fine?

I think you could pin freely on cuBLAS and let cuda-version handle the constraints?

run:
  - cuda-cublas
  - cuda-version>=12.0,<13

bdice commented 1 year ago

I thought you agreed that

if a package needs a minimal (or a range of) runtime version, pin at cuda-version

I was thinking you meant this for downstream packages -- CUDA Toolkit packages should be exactly pinned to cuda-version==X.Y, and downstream consumers of the CUDA Toolkit can pin library names (e.g. cuda-cublas) and allowable ranges of cuda-version (e.g. cuda-version>=12.0,<13).

leofang commented 1 year ago

Ah OK thanks for clarifying, Bradley. So you're saying if we make a super tight pinning for CUDA components, cuda-version becomes the version selector that used to be served by the old cudatoolkit? So, instead of

conda install cupy rmm cudatoolkit=12.3

we would do

conda install cupy rmm cuda-version=12.3

If so I am on board. Speaking from the component author viewpoint, this would allow me to express "this version of cuSOLVER needs exactly this version of cuBLAS" easily.

edit: More specifically, I saw @jakirkham point to run_constrained: __cuda>={{ major_version }} in cuda-version. That is problematic if you have a system with an older GPU (doesn't support CUDA 12) but want to build CUDA 12 packages.

I am not sure this is OK.

Don't we still need libcuda from CTK 12 if driver API is used in a package? If we have an old GPU and we're building a package for new CTK, I'd think we need a compatible driver. Or, do we plan to shift this responsibility to cuda-compat alone and not worry about this? Update: see https://github.com/conda-forge/cuda-version-feedstock/issues/1#issuecomment-1462856372.
We really need to ensure when users install GPU packages, they have a compatible driver. By not putting this constraint somewhere, we lose this ability.

bdice commented 1 year ago

I wanted to reiterate the behaviors @kkraus14 raised above, with context from this continued discussion so far. I think these outline a good set of core behaviors and establish expectations that we can (and should) meet. There may be needs above and beyond these, but I think it's unlikely that other needs would conflict with the goals Keith laid out. It sounds like we're starting to narrow in on our expectations, so I hope this can clarify/corroborate previous statements (it shouldn't conflict with the consensus we've built above, I hope).

My 2c, ultimately when it comes to CUDA packages users will want a few different behaviors:

The ability to get a specific shipped full toolkit at a given version, i.e. 12.0 or 12.1.

This would presumably be handled by either the cuda-toolkit or cuda package?

The full toolkit should ship as cuda-toolkit with X.Y versioning (and potentially patches? Let's handle that separately).

The ability to get specific packages at a given toolkit version, i.e. 12.0 or 12.1.

This would presumably be handled by a combination of this cuda-version package along with some type of run_constrained?

The ability to get specific packages not tied to a given toolkit version, i.e. I may want to get Thrust 2.0.1

I'm unclear if we want to allow this for only specific packages that have historically and/or explicitly supported building and running on different CUDA versions like Thrust, CUB, and libcu++ or more generally for all of the cuda packages.

Not to allow solving to an incompatible environment, i.e. libcusolver depends on libcublas where if they both require the same minor / patch version, there should be appropriate pinning to enforce such.

Users should be able to install like cuda-version==12.1 cuda-cudart cuda-cublas and get 12.1 packages for cudart and cuBLAS. All of the above goals should be met by making the CUDA component packages depend on cuda-version==X.Y. That pinning should be applied to:

compilers
runtime libraries
math libraries (including dev/static), nvJitLink
headers like cuda-profiler-api
tools like nsight
(other notable categories?)

Exceptions to the cuda-version pinning might include libraries that float more freely or may have broader compatibility like:

cuda-cccl
cuda-python
(other notable exceptions?)

leofang commented 1 year ago

Don't we still need libcuda from CTK 12 if driver API is used in a package?

I forgot we ship the libcuda stub somewhere, so this is moot. Only the 2nd question still persists.

bdice commented 1 year ago

We really need to ensure when users install GPU packages, they have a compatible driver. By not putting this constraint somewhere, we lose this ability.

In my understanding, the status quo is that downstream packages must declare their dependence on the driver (via the __cuda virtual package) if they want a constraint for where the package can be installed/run. Examples:

edit: see below, this is inaccurate

~Driver version compatibility doesn't seem to be handled by cudatoolkit in CUDA 11 -- and I think that's the status quo we want in CUDA 12. Downstream packages are responsible for knowing and declaring their (potentially distinct) runtime (cuda-toolkit and component packages) and driver (__cuda virtual package) requirements.~

leofang commented 1 year ago

This doesn't seem to be handled by cudatoolkit in CUDA 11 -- and I think that's the status quo we want in CUDA 12

No, it's not true: https://github.com/conda-forge/cudatoolkit-feedstock/blob/531e4594992258568fe187bc5c4e40d8c9c57b27/recipe/meta.yaml#L576-L582 and it's dangerous to omit. I'd rather set up the constraint correctly for users (following the same spirit as the above discussion), and figure out if there's a way to relax for certain package maintainers (or if it's even needed, given this is the status quo).

bdice commented 1 year ago

Hmm. 🤔 You're right. (Unfortunately that doesn't handle the "old GPU" case -- if you have any GPU installed, its driver must be compatible with the toolkit you install, even if it's only for building and not running...). I can change my position in light of that, and support keeping the run_constrained on __cuda somewhere (probably cuda-version). Downstream packages may still want to specify __cuda dependencies in certain cases (CPU/GPU split packages or particular driver requirements for example).

leofang commented 1 year ago

The ability to get specific packages not tied to a given toolkit version, i.e. I may want to get Thrust 2.0.1

I think this is the last question that is unresolved by the above discussion, before we can write up a summary (yay!) In Bradley's approach (tightly pinning cuda-version) this would simply be not possible, and we must adopt a relaxed scheme / exception that both Bradley and I touched about (here and here).

@kkraus14 any comment on this?

tools like nsight

btw I think Nsight (System/Compute) can be relaxed too. They actually release it outside of CTK, and I've tested the same Nsight version works for both CTK 11/12 just fine. I think the CTK bundle is just for convenience, but we can certainly ask around to confirm.

kkraus14 commented 1 year ago

If someone has an old GPU driver, say their GPU is older and doesn't support newer drivers, I think the default behavior we should give is to constrain them to things that run on their system.

If they want to temporarily bypass that, they can do so using the CONDA_OVERRIDE_CUDA environment variable as documented here: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html

If this isn't clear enough to this corner case of users, then I think we should treat this as a documentation problem as opposed to a solver constraint problem.

leofang commented 1 year ago

UPDATE: The latest summary of this cuda-version package can be found here: https://github.com/conda-forge/cuda-version-feedstock/blob/main/recipe/README.md

I took a stab at summarizing the above discussion. Let me know if I miss (or am wrong about) anything, and I'll edit in place. Thanks!

Summary

In-scope for `cuda-version`

Questions/demands listed below are addressed:

Which CUDA versions a package would support?
How a particular CUDA library is tied to a specific CUDA version?
How to pull in dependencies that align with a particular CUDA Toolkit release?
What CUDA compiler version a package was built with?
The ability to get a specific shipped full toolkit at a given version, i.e. 12.0 or 12.1.
The ability to get specific packages at a given toolkit version, i.e. 12.0 or 12.1.
Not to allow solving to an incompatible environment

Out-of-scope for `cuda-version`

What CUDA driver version requirements might exist?

Driver requirement should be set up using the __cuda virtual package. For package maintainers/users who wish to overwrite (at their own risk), use the environment variable CONDA_OVERRIDE_CUDA.

How CUDA version tracking support ties into CUDA compatibility

For CUDA components from CTK, we currently disallow mix-n-match. This constraint is something we will revisit in the future.
For downstream packages depending on CUDA, use cuda-version >=X.0,<X+1 to allow minor version compatibility (see below).

The ability to get specific packages not tied to a given toolkit version, i.e. I may want to get Thrust 2.0.1

Unclear if possible under the "tight constraint" scheme that we plan to adopt.

Scheme

This feedstock (cuda-version):
- Set up the driver requirement using __cuda for a given CTK major version X (see https://github.com/conda-forge/cuda-version-feedstock/issues/1#issuecomment-1462752780).
nvcc compiler:
- Pin a range of cuda-version, with the lower bound being the CTK version X.Y that it comes with
CTK components:
- Pin at the exact cuda-version corresponding to the CTK version X.Y that they come with. Exceptions should be discussed on a case-by-case basis.
- This applies to all flavors (lib, lib-dev and lib-static)
Downstream packages:
- Depend on needed CTK components without version constraints, and set the version constraint on CUDA by depending only on cuda-version (plus __cuda, if applicable)
Downstream users:
- Use cuda-version as the CTK version selector to set up a consistent environment

Examples

For maintainers of nvcc

Example: nvcc from the CTK version X.Y

- run_exports:
  - cuda-version >=X.Y,<X+1

For maintainers of all CTK components

Example: cuBLAS from the CTK version X.Y

- host:
  - cuda-version X.Y
- run:
  - {{ pin_compatible("cuda-version", max_pin="x.x") }}

Example: cuSOLVER depends on cuBLAS from the same CTK version X.Y

- host:
  - cuda-version X.Y
  - libcublas
- run:
  - {{ pin_compatible("cuda-version", max_pin="x.x") }}
  - {{ pin_compatible("libcublas", max_pin="x.x") }}

For maintainers of GPU packages depending on CUDA

Example: A package depends on cuBLAS and supports CUDA minor version compatibility

- run:
  - cuda-version >=X.0,<X+1
  - libcublas

For users of GPU packages

Example: Set up an environment compatible with CTK 12.1

conda install -c conda-forge cupy rmm "cuda-version=12.1"

Example: Set up an environment compatible with CTK 12 (of any minor version)

conda install -c conda-forge cupy rmm "cuda-version=12"

Example: Just set up a legit CUDA environment

conda install -c conda-forge cupy rmm

jakirkham commented 1 year ago

Thanks all for sharing your thoughts here! 🙏

Generally this seems reasonable

Will focus on one point that is unaddressed

What CUDA compiler version a package was built with?

This should be queried from {{ compiler("cuda") }} or equivalent.

So at build time of a package, the compiler version would be set. Currently this is done as a global pin (with the option to override for a particular package if needed). Likely we would continue this for CUDA 12.

Once the package was built (using the current model), cudatoolkit would be added as a run dependency. However this would need to change in the future as cudatoolkit would not be added as a dependency of a package.

This matters as a package may be able to use a new CUDA feature (for example the CUDA stream-ordered allocator added in 11.2) only if it was built with a new enough compiler version. However to get that package built with that compiler version, somehow a user must specify that dependency.

So how do we encode this information for users of packages? Essentially this would be adding some kind of runtime dependency on packages built with cuda-nvcc, but what should it be. Here are some options:

cuda-version (same as used by CTK packages)
A second/different metapackage, like cuda-nvcc-version or similar
cuda-cudart (extra as it is statically linked by cuda-nvcc currently, but maybe that could change)
__cuda (users could affect this with CONDA_OVERRIDE_CUDA as noted above)
?

Thoughts on any of these (or others)?

robertmaynard commented 1 year ago

If someone has an old GPU driver, say their GPU is older and doesn't support newer drivers, I think the default behavior we should give is to constrain them to things that run on their system.

If they want to temporarily bypass that, they can do so using the CONDA_OVERRIDE_CUDA environment variable as documented here: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html

If this isn't clear enough to this corner case of users, then I think we should treat this as a documentation problem as opposed to a solver constraint problem.

This won't break CI machines which have either no GPU driver or an old one ( and link to stubs/libcuda.so )?

leofang commented 1 year ago

Thanks, @jakirkham, this is a very good (and subtle) point.

So how do we encode this information for users of packages? Essentially this would be adding some kind of runtime dependency on packages built with cuda-nvcc, but what should it be. Here are some options:

cuda-version (same as used by CTK packages)

I would think the new compiler package should continue doing run_exports but use this, so for nvcc from CTK ver X.Y we do

- run_exports:
  - cuda-version >=X.Y,<X+1

Seems to me it'd match the current behavior that you linked to.

leofang commented 1 year ago

This won't break CI machines which have either no GPU driver or an old one ( and link to stubs/libcuda.so )?

@robertmaynard as discussed above, we'd only list __cuda under run_constrained, meaning

On CPU-only machines, __cuda isn't set, so there's simply no conflict/problem
On machines with an old driver, __cuda is set to a lower value, so we need CONDA_OVERRIDE_CUDA to overwrite it and let the (new) libcuda stub kick in.

jakirkham commented 1 year ago

Thanks all! 🙏

So it sounds like we have reached consensus. Though please let me know if I've missed something

Next steps would be updating the CUDA Toolkit libraries (notably cublas) to follow this approach. Anything else we should do?

kkraus14 commented 1 year ago

What was the conclusion / where did we land for packages like Thrust, cub, and libcu++ that have versions not tied to a specific CUDA Toolkit release?

leofang commented 1 year ago

@kkraus14 Maybe you've missed it, I left a question for you:

The ability to get specific packages not tied to a given toolkit version, i.e. I may want to get Thrust 2.0.1

I think this is the last question that is unresolved by the above discussion, before we can write up a summary (yay!) In Bradley's approach (tightly pinning cuda-version) this would simply be not possible, and we must adopt a relaxed scheme / exception that both Bradley and I touched about (here and here).

@kkraus14 any comment on this?

Would it be OK to you that we relax the pinning to cuda-version a bit (in anyway -- but which way?) for CCCL?

leofang commented 1 year ago

Anything else we should do?

@jakirkham Could you elaborate a bit how would {{ compiler("cuda") }} be set up for CTK 12? I understand there's nvcc-feedstock, and somehow {{ compiler("cuda") }} magically maps to the nvcc wrapper there, but it's unclear to me how this mapping works and how we'd map it to cuda-nvcc (but only for CTK >=12, and preserves the status quo for CTK <12).

kkraus14 commented 1 year ago

Would it be OK to you that we relax the pinning to cuda-version a bit (in anyway -- but which way?) for CCCL?

I think ideally from my perspective we could have a cuda-cccl package that is a metapackage around Thrust, cub, and libcu++ and the cuda-cccl package uses cuda-version similar to other packages like cublas. This way if someone wants specifically what's in CUDA 12.0 or 12.1 they can use the cuda-cccl package. The tradeoff being they can only get all three packages as opposed to just one of them, but that's a very reasonable tradeoff from my perspective, especially considering they're header only (at least as of now).

Then for the individual packages like Thrust, cub, and libcudacxx they can have a more relaxed pinning that allows them to float forward.

The discussion about whether cuda-cccl can be a metapackage around Thrust, cub, and libcudacxx packages is happening here: https://github.com/conda-forge/staged-recipes/pull/21953

leofang commented 1 year ago

I guess you want to keep the freedom of using thrust/cub in a mix-n-match fashion? If so, any chance cuda-cccl cannot be a real package that depends on cuda-version and also has this setup?

- run_constrained:
  - thrust <0.0a0
  - cub <0.0a0
  - libcudacxx <0.0a0  # assuming this is on conda-forge

This would make sure that the new cuda-cccl is mutually exclusive to thrust/cub; that is, they cannot coexist in the same environment.

kkraus14 commented 1 year ago

I guess you want to keep the freedom of using thrust/cub in a mix-n-match fashion?

Yes. There's no reason that we shouldn't be able to package things like thrust, cub, and others that have release schedules disjointed from the cuda toolkit.

If so, any chance cuda-cccl cannot be a real package that depends on cuda-version and also has this setup?
- run_constrained:
  - thrust <0.0a0
  - cub <0.0a0
  - libcudacxx <0.0a0  # assuming this is on conda-forge
This would make sure that the new cuda-cccl is mutually exclusive to thrust/cub; that is, they cannot coexist in the same environment.

The problem with this is someone might have an environment with two libraries, say LibraryA and LibraryB, both of which depend on Thrust in run because they expose Thrust in their public interfaces. LibraryA builds with the Thrust package, LibraryB builds with the cuda-cccl package. We end up in an unsolvable situation.

jakirkham commented 1 year ago

Anything else we should do?

@jakirkham Could you elaborate a bit how would {{ compiler("cuda") }} be set up for CTK 12? I understand there's nvcc-feedstock, and somehow {{ compiler("cuda") }} magically maps to the nvcc wrapper there, but it's unclear to me how this mapping works and how we'd map it to cuda-nvcc (but only for CTK >=12, and preserves the status quo for CTK <12).

Sure currently we specify the cuda_compiler in conda-forge-pinning. Since we would want different ones for different CUDA versions, we would add it to zip_keys alongside cuda_compiler_version and expand cuda_compiler to multiple values (with the older compiler repeated for old CUDA versions). This shouldn't be too hard and could be handled as part of a CUDA 12 migrator.

If we add these packages also for CUDA 11 versions (this would require some careful work), we could potentially simplify this structure in the future.

jakirkham commented 1 year ago

Also would propose we solve the Thrust/CUB/libcudacxx/CCCL story separately. There's a few things going on there that need to be addressed. Should add see this as the next thing to solve after wrapping up this conversation about cuda-version (hence the interest in wrapping this one up if we have reached consensus ;)

leofang commented 1 year ago

(hence the interest in wrapping this one up if we have reached consensus ;)

I updated my summary at https://github.com/conda-forge/cuda-version-feedstock/issues/1#issuecomment-1462998398. I agree if there's no more question (other than CCCL) we close this issue and start working on the libraries.

bdice commented 1 year ago

One minor clarification to @leofang's summary:

Set up the driver requirement using __cuda for a given CTK version X.Y

The driver requirement should be __cuda >={{ major_version }}, like >=12 (without the .Y), to match the cudatoolkit status quo and utilize CUDA major version compatibility.

As for cuda-cccl / Thrust / CUB / libcu++ discussions, I agree that we should discuss that in a separate thread from the overall versioning/dependency strategy as an exceptional case. A metapackage may not be the right long term solution because those repos will soon be combined into a monorepo and released on a unified cycle.

leofang commented 1 year ago

The driver requirement should be __cuda >={{ major_version }}, like >=12 (without the .Y), to match the cudatoolkit status quo and utilize CUDA major version compatibility.

Yes, this was what's implied, see John K's https://github.com/conda-forge/cuda-version-feedstock/issues/1#issuecomment-1462752780 above (btw I still think >=X is just the same as >=X.0), but let me add to the summary for clarity.

jakirkham commented 1 year ago

btw I still think >=X is just the same as >=X.0

Yeah that sounds right

bdice commented 1 year ago

Clarification: should all parts of a package (e.g. libcublas, libcublas-dev, libcublas-static) depend on cuda-version, or just the main package (e.g. libcublas)?

If we make all parts depend on cuda-version (which is what I expect), then does it make sense to have pinnings like libcublas-dev having a run: libcublas >={{ version }} dependency, or should that be replaced by a tightly-constrained pin_subpackage? Is it even possible to install newer library versions and older dev packages if both depend on a specific pinning of cuda-version?

I asked this for cuda-cudart too, which has additional platform-specific packages like cuda-cudart-dev_{{ target_platform }} -- do the same rules apply there?

leofang commented 1 year ago

The only concern I could think of is whether it'd break the story that we build with higher CTK version and allow it to run with lower CTK version (both within the same major release), as discussed in https://github.com/conda-forge/cuda-version-feedstock/issues/1#issuecomment-1462833464.

But if the -dev/-static packages don't export its cuda-version constraint, it should be fine? (Even if it's exported, we can always undo it with ignore_run_export{_from}, but it's more work, more error prone, and less intuitive.)

Another thought is I feel the -dev/-static packages, which are only used at compile time, should follow the compiler constraint that we set, which currently is

nvcc compiler: Pin a range of cuda-version, with the lower bound being the CTK version X.Y that it comes with

conda-forge / cuda-version-feedstock

`cuda-version`: How do we want to use this? #1

Summary

In-scope for `cuda-version`

Out-of-scope for `cuda-version`

Scheme

Examples

For maintainers of nvcc

For maintainers of all CTK components

For maintainers of GPU packages depending on CUDA

For users of GPU packages

conda-forge / cuda-version-feedstock

`cuda-version`: How do we want to use this? #1

Summary

In-scope for cuda-version

Out-of-scope for cuda-version

Scheme

Examples

For maintainers of nvcc

For maintainers of all CTK components

For maintainers of GPU packages depending on CUDA

For users of GPU packages

In-scope for `cuda-version`

Out-of-scope for `cuda-version`