Closed dkolsen-pgi closed 5 days ago
Hi! I am sorry this causes a breakage for nvc++. I didn't know that cooperative_groups
are not supported on nvc++. I hope we can detect such a breakage sooner, e.g. when nvc++ CI jobs land #1488.
Since I am leaving for parental leave very soon, the only quick solution I see is
- Change
cub/cub.cuh
to not include<cub/device/device_transform.cuh>
.
and then figure out how we can proceed later.
Discussed with @jrhemstad, who is going to follow-up on this for the short term.
I discussed this briefly with @jrhemstad yesterday and we would like to fix cooperative groups in the long run (option 1). However, this may still take a while. In the meantime, once #2396 is merged, we can disable the ublkcp kernel that uses cooperative groups when compiling with nvc++ (option 3). The prefetch implementation should work with nvc++ and also deliver solid runtime improvements.
I could reproduce and workaround the issue by disabling CG and the ublkcp kernel:
~/cccl $ cat cg.cpp
#include <cub/cub.cuh>
int main() {}
~/cccl $ nvc++ -Icub -Ithrust -Ilibcudacxx/include --c++20 -stdpar cg.cpp
~/cccl $
That's the extent to which I could test CUB with nvc++.
Is this a duplicate?
Type of Bug
Compile-time Error
Component
CUB
Describe the bug
PR #2086 breaks stdexec example nvexec.launch when compiled with NVC++. Compilation fails with unhelpful errors such as
error: namespace "cooperative_groups" has no member "thread_block_tile"
. @ericnieblerPR #2086 added two new files to the CUB headers. One of them,
cub/device/dispatch/dispatch_transform.cuh
, which is indirectly included fromcub/cub.cuh
, contains#include <cooperative_groups.h>
. The header<cooperative_groups.h>
is entirely wrapped by an#if defined(__cplusplus) && defined(__CUDACC__)
block. When compiling withnvc++ -stdpar=gpu
, the macro__CUDACC__
is not defined, so<cooperative_groups.h>
is a no-op. Subsequent attempts to use stuff from thecooperative_groups
namespace fail with undefined identifiers.This doesn't break NVC++'s stdpar parallel algorithms yet because nothing in the parallel algorithm implementation includes
cub/cub.cuh
orcub/device/device_transform.cuh
. But that will change ifthrust::transform
is changed to use the new CUB transform algorithms. I would like to get this fixed before that happens, when the impact of this bug is still small.I don't know the correct way to fix this. Some possibilities are:
<cooperative_groups.h>
to work withnvc++ -stdpar
. (CUB would still need to deal with the issue as long as a CUDA Toolkit without the cooperative groups change is still supported.)cub/cub.cuh
to not include<cub/device/device_transform.cuh>
. Any code that wants to use the new CUB transform algorithms needs to include<cub/device/device_transform.cuh>
explicitly. (This then pushes the problem to Thrust, which would need to adopt option 2 or 3.)All the options have tradeoffs, and I don't know how best to balance those tradeoffs.
How to Reproduce
Though first noticed by stdexec example nvexec.launch, which includes
<cub/cub.cuh>
, it can be reproduced with a much smaller test, with NVC++ that uses the latest main branch of CCCL.Expected behavior
It should be possible to use CUB with
nvc++ -stdpar
without errors.Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response