Open lkskstlr opened 2 years ago
Hello, @lkskstlr! Thank you for your feedback. Unfortunately, your code snippet is insufficient to reproduce this error. I've extracted CUB related parts in the following code:
#include <thrust/device_vector.h>
#include <cub/block/block_reduce.cuh>
#include <iostream>
template<int ThreadsPerBlock,
int ItemsPerThread>
__global__ void kernel(float *data) {
typedef cub::BlockReduce<double, ThreadsPerBlock, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce;
__shared__ typename BlockReduce::TempStorage temp_storage;
double aggregates[ItemsPerThread];
for (int idx = 0; idx < 7; idx++) {
aggregates[idx] = BlockReduce(temp_storage).Sum(data[idx * ThreadsPerBlock + threadIdx.x]);
__syncthreads(); // Needed due to temp_storage reuse
}
if (threadIdx.x == 0) {
printf("%d\n", (int) sizeof(BlockReduce::TempStorage));
for (int idx = 0; idx < 7; idx++) {
printf("agg=%f\n", aggregates[idx]);
}
}
}
constexpr int items_per_thread = 7;
constexpr int threads_per_block = 512;
constexpr int elements = items_per_thread * threads_per_block;
__global__ void launcher(float *data) {
if (threadIdx.x == 0) {
kernel<threads_per_block, items_per_thread><<<1, threads_per_block>>>(data);
if (cudaGetLastError() != cudaSuccess) {
printf("CUDA Error!");
}
if (cudaDeviceSynchronize() != cudaSuccess) {
printf("CUDA Error!");
}
}
__syncthreads();
}
int main(void) {
thrust::device_vector<float> in(elements);
thrust::sequence(in.begin(), in.end());
launcher<<<1, 96>>>(thrust::raw_pointer_cast(in.data()));
launcher<<<1, 96>>>(thrust::raw_pointer_cast(in.data()));
launcher<<<1, 96>>>(thrust::raw_pointer_cast(in.data()));
}
Please, let me know if it reproduces the described issue on your setup. It seems to work fine on mine. If the code above doesn't represent your case, please, feel free to update it here. If this code represents your case, I believe that the issue is in the code parts you've omitted.
@lkskstlr Any updates for this? See @senior-zero's questions above.
@allisonvacanti sorry, I missed the answer on this issue. I will see if I can still reproduce the bug. Thanks for following up :)
Dear Maintainers,
thank you for the awesome library, I really like it :)
I have a strange launch failure when using
cub::BlockReduce<double, TPB, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce
together with CUDA Dynamic Parallelism (CDP). When I uncomment allcub
code from the Kernel, the error does not appear.The Kernel code is roughly
The caller is also a Kernel of the following structure
The outer Kernel is launched with only 1 block like
For the following
TPB_CALC_RES
I getI am running on Ubuntu 18.04, Nvidia driver
455.23.05
, CUAD11.1
and an RTX 2080 super. I use separable compilation. Here is my cmake output:Any help would be much appreciated :)
From the docs it is also not 100% clear to me if dynamic parallelism and block-wide directives are supported but I couldn't find any particular info on that.
Have a nice day Lukas