[BUG]: CUB device_transform breaks nvc++ -stdpar

dkolsen-pgi commented 1 month ago

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Compile-time Error

Component

CUB

Describe the bug

PR #2086 breaks stdexec example nvexec.launch when compiled with NVC++. Compilation fails with unhelpful errors such as error: namespace "cooperative_groups" has no member "thread_block_tile". @ericniebler

PR #2086 added two new files to the CUB headers. One of them, cub/device/dispatch/dispatch_transform.cuh, which is indirectly included from cub/cub.cuh, contains #include <cooperative_groups.h>. The header <cooperative_groups.h> is entirely wrapped by an #if defined(__cplusplus) && defined(__CUDACC__) block. When compiling with nvc++ -stdpar=gpu, the macro __CUDACC__ is not defined, so <cooperative_groups.h> is a no-op. Subsequent attempts to use stuff from the cooperative_groups namespace fail with undefined identifiers.

This doesn't break NVC++'s stdpar parallel algorithms yet because nothing in the parallel algorithm implementation includes cub/cub.cuh or cub/device/device_transform.cuh. But that will change if thrust::transform is changed to use the new CUB transform algorithms. I would like to get this fixed before that happens, when the impact of this bug is still small.

I don't know the correct way to fix this. Some possibilities are:

Change <cooperative_groups.h> to work with nvc++ -stdpar. (CUB would still need to deal with the issue as long as a CUDA Toolkit without the cooperative groups change is still supported.)
Change CUB to not use cooperative groups.
Change CUB to not use cooperative groups when being compiled with 'nvc++ -stdpar'.
Change cub/cub.cuh to not include <cub/device/device_transform.cuh>. Any code that wants to use the new CUB transform algorithms needs to include <cub/device/device_transform.cuh> explicitly. (This then pushes the problem to Thrust, which would need to adopt option 2 or 3.)

All the options have tradeoffs, and I don't know how best to balance those tradeoffs.

How to Reproduce

Though first noticed by stdexec example nvexec.launch, which includes <cub/cub.cuh>, it can be reproduced with a much smaller test, with NVC++ that uses the latest main branch of CCCL.

$ cat cg.cpp
#include <cub/cub.cuh>
int main() { }
$ nvc++ --c++20 -stdpar cg.cpp
"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: expected a ">"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                          ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: expected a ";"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                                    ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 68: error: namespace "cooperative_groups" has no member "coalesced_group"
  struct _async_copy_group_supported<cooperative_groups::coalesced_group> : public _CG_STL_NAMESPACE::true_type {};
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 70: error: namespace "cooperative_groups" has no member "thread_block"
  struct _async_copy_group_supported<cooperative_groups::thread_block> : public _CG_STL_NAMESPACE::true_type {};
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 70: error: class "cooperative_groups::__v1::details::_async_copy_group_supported<<error-type>>" has already been defined (previous definition at line 68)
  struct _async_copy_group_supported<cooperative_groups::thread_block> : public _CG_STL_NAMESPACE::true_type {};
         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                       ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: expected a ">"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                                        ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: expected a ";"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                                                 ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                       ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: expected a ">"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                        ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: expected a ";"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                                  ^

12 errors detected in the compilation of "cg.cpp".

Expected behavior

It should be possible to use CUB with nvc++ -stdpar without errors.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

bernhardmgruber commented 1 month ago

Hi! I am sorry this causes a breakage for nvc++. I didn't know that cooperative_groups are not supported on nvc++. I hope we can detect such a breakage sooner, e.g. when nvc++ CI jobs land #1488.

Since I am leaving for parental leave very soon, the only quick solution I see is

Change cub/cub.cuh to not include <cub/device/device_transform.cuh>.

and then figure out how we can proceed later.

bernhardmgruber commented 1 month ago

Discussed with @jrhemstad, who is going to follow-up on this for the short term.

bernhardmgruber commented 1 week ago

I discussed this briefly with @jrhemstad yesterday and we would like to fix cooperative groups in the long run (option 1). However, this may still take a while. In the meantime, once #2396 is merged, we can disable the ublkcp kernel that uses cooperative groups when compiling with nvc++ (option 3). The prefetch implementation should work with nvc++ and also deliver solid runtime improvements.

bernhardmgruber commented 1 week ago

I could reproduce and workaround the issue by disabling CG and the ublkcp kernel:

~/cccl $ cat cg.cpp 
#include <cub/cub.cuh>
int main() {}
~/cccl $ nvc++ -Icub -Ithrust -Ilibcudacxx/include --c++20 -stdpar cg.cpp
~/cccl $

That's the extent to which I could test CUB with nvc++.

NVIDIA / cccl