NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.25k stars 160 forks source link

[BUG]: CUB device_transform breaks nvc++ -stdpar #2402

Closed dkolsen-pgi closed 5 days ago

dkolsen-pgi commented 1 month ago

Is this a duplicate?

Type of Bug

Compile-time Error

Component

CUB

Describe the bug

PR #2086 breaks stdexec example nvexec.launch when compiled with NVC++. Compilation fails with unhelpful errors such as error: namespace "cooperative_groups" has no member "thread_block_tile". @ericniebler

PR #2086 added two new files to the CUB headers. One of them, cub/device/dispatch/dispatch_transform.cuh, which is indirectly included from cub/cub.cuh, contains #include <cooperative_groups.h>. The header <cooperative_groups.h> is entirely wrapped by an #if defined(__cplusplus) && defined(__CUDACC__) block. When compiling with nvc++ -stdpar=gpu, the macro __CUDACC__ is not defined, so <cooperative_groups.h> is a no-op. Subsequent attempts to use stuff from the cooperative_groups namespace fail with undefined identifiers.

This doesn't break NVC++'s stdpar parallel algorithms yet because nothing in the parallel algorithm implementation includes cub/cub.cuh or cub/device/device_transform.cuh. But that will change if thrust::transform is changed to use the new CUB transform algorithms. I would like to get this fixed before that happens, when the impact of this bug is still small.

I don't know the correct way to fix this. Some possibilities are:

  1. Change <cooperative_groups.h> to work with nvc++ -stdpar. (CUB would still need to deal with the issue as long as a CUDA Toolkit without the cooperative groups change is still supported.)
  2. Change CUB to not use cooperative groups.
  3. Change CUB to not use cooperative groups when being compiled with 'nvc++ -stdpar'.
  4. Change cub/cub.cuh to not include <cub/device/device_transform.cuh>. Any code that wants to use the new CUB transform algorithms needs to include <cub/device/device_transform.cuh> explicitly. (This then pushes the problem to Thrust, which would need to adopt option 2 or 3.)

All the options have tradeoffs, and I don't know how best to balance those tradeoffs.

How to Reproduce

Though first noticed by stdexec example nvexec.launch, which includes <cub/cub.cuh>, it can be reproduced with a much smaller test, with NVC++ that uses the latest main branch of CCCL.

$ cat cg.cpp
#include <cub/cub.cuh>
int main() { }
$ nvc++ --c++20 -stdpar cg.cpp
"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: expected a ">"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                          ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 65: error: expected a ";"
  struct _async_copy_group_supported<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                                    ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 68: error: namespace "cooperative_groups" has no member "coalesced_group"
  struct _async_copy_group_supported<cooperative_groups::coalesced_group> : public _CG_STL_NAMESPACE::true_type {};
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 70: error: namespace "cooperative_groups" has no member "thread_block"
  struct _async_copy_group_supported<cooperative_groups::thread_block> : public _CG_STL_NAMESPACE::true_type {};
                                                         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 70: error: class "cooperative_groups::__v1::details::_async_copy_group_supported<<error-type>>" has already been defined (previous definition at line 68)
  struct _async_copy_group_supported<cooperative_groups::thread_block> : public _CG_STL_NAMESPACE::true_type {};
         ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                       ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: expected a ">"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                                        ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 80: error: expected a ";"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<1, TyPar>>
                                                                                 ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: namespace "cooperative_groups" has no member "thread_block_tile"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                       ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: expected a ">"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                        ^

"/home/dolsen/work/pgi/dev/nv/Linux_x86_64/mine/cuda/11.8/include/cooperative_groups/details/async.h", line 84: error: expected a ";"
  struct _async_copy_optimize_tile<cooperative_groups::thread_block_tile<Sz, TyPar>>
                                                                                  ^

12 errors detected in the compilation of "cg.cpp".

Expected behavior

It should be possible to use CUB with nvc++ -stdpar without errors.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

bernhardmgruber commented 1 month ago

Hi! I am sorry this causes a breakage for nvc++. I didn't know that cooperative_groups are not supported on nvc++. I hope we can detect such a breakage sooner, e.g. when nvc++ CI jobs land #1488.

Since I am leaving for parental leave very soon, the only quick solution I see is

  1. Change cub/cub.cuh to not include <cub/device/device_transform.cuh>.

and then figure out how we can proceed later.

bernhardmgruber commented 1 month ago

Discussed with @jrhemstad, who is going to follow-up on this for the short term.

bernhardmgruber commented 1 week ago

I discussed this briefly with @jrhemstad yesterday and we would like to fix cooperative groups in the long run (option 1). However, this may still take a while. In the meantime, once #2396 is merged, we can disable the ublkcp kernel that uses cooperative groups when compiling with nvc++ (option 3). The prefetch implementation should work with nvc++ and also deliver solid runtime improvements.

bernhardmgruber commented 1 week ago

I could reproduce and workaround the issue by disabling CG and the ublkcp kernel:

~/cccl $ cat cg.cpp 
#include <cub/cub.cuh>
int main() {}
~/cccl $ nvc++ -Icub -Ithrust -Ilibcudacxx/include --c++20 -stdpar cg.cpp
~/cccl $

That's the extent to which I could test CUB with nvc++.