Launch failure potentially based on `cub::BlockReduce<double, TPB, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce`

lkskstlr commented 2 years ago

Dear Maintainers,

thank you for the awesome library, I really like it :)

I have a strange launch failure when using cub::BlockReduce<double, TPB, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce together with CUDA Dynamic Parallelism (CDP). When I uncomment all cub code from the Kernel, the error does not appear.

The Kernel code is roughly

// some cu file

template<typename Scalar, int TPB>
__global__ void calcResKernel(
    CudaCoarseTrackerDeviceData<Scalar> data,
    CCTcalcResInOut *in_out
) {
  typedef cub::BlockReduce<double, TPB, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce;
  __shared__ typename BlockReduce::TempStorage temp_storage;

  // a lot of code I cannot share

double aggregates[7];
  for (int idx = 0; idx < 7; idx++) {
    aggregates[idx] = BlockReduce(temp_storage).Sum(private_output[idx]);
    __syncthreads(); // Needed due to temp_storage reuse
  }

  if (threadIdx.x == 0) {
    for (int idx = 0; idx < 7; idx++) {
      const double old = atomicAdd(in_out->outputs + idx, aggregates[idx]);
      printf("idx = %2d, old=%f, agg=%f\n", idx, old, aggregates[idx]);
    }
  }
};

template __global__ void calcResKernel<float, 32>(CudaCoarseTrackerDeviceData<float> data, CCTcalcResInOut *in_out);
template __global__ void calcResKernel<float, 64>(CudaCoarseTrackerDeviceData<float> data, CCTcalcResInOut *in_out);
template __global__ void calcResKernel<float, 96>(CudaCoarseTrackerDeviceData<float> data, CCTcalcResInOut *in_out);
template __global__ void calcResKernel<float, 128>(CudaCoarseTrackerDeviceData<float> data, CCTcalcResInOut *in_out);
template __global__ void calcResKernel<float, 256>(CudaCoarseTrackerDeviceData<float> data, CCTcalcResInOut *in_out);
template __global__ void calcResKernel<float, 512>(CudaCoarseTrackerDeviceData<float> data, CCTcalcResInOut *in_out);

The caller is also a Kernel of the following structure

template<typename Scalar, int TPB_CALC_RES, int TPB_CALC_GS>
__global__ void cctInitKernel(CudaCoarseTrackerDeviceData<Scalar> data) {
  int bidx = blockIdx.x;
  int tidx = threadIdx.x;
  int i = bidx * blockDim.x + tidx;

// code i cannot share

    if (i == 0) {
      // code i cannot share
      printf("LAUNCH CONFIG: %d, %d\n", DIV_UP(nl, TPB_CALC_RES), TPB_CALC_RES);
      calcResKernel<Scalar, TPB_CALC_RES><<<DIV_UP(nl, TPB_CALC_RES), TPB_CALC_RES>>>(data, data.calcResInOut);
      cucheck_dev(cudaGetLastError());
      cucheck_dev(cudaDeviceSynchronize());
      updateResInDataOPt(data);
    }
    __syncthreads();

}

The outer Kernel is launched with only 1 block like

cctInitKernel<float, 512, 512><<<1, 96>>>(data);

For the following TPB_CALC_RES I get

128: Works
256: Works
512: invalid configuration argument in cudaGetLastError()

I am running on Ubuntu 18.04, Nvidia driver 455.23.05, CUAD 11.1 and an RTX 2080 super. I use separable compilation. Here is my cmake output:

-- The CXX compiler identification is GNU 7.5.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/lib/ccache/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Eigen headers in: /usr/local/include/eigen3
-- Found Boost: /usr/include (found version "1.65.1")  
-- The CUDA compiler identification is NVIDIA 11.1.74
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.1.74") 
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Release mode (!= not Debug). Adding -O3 to NVCC FLAGS
-- Configuring done
-- Generating done

Any help would be much appreciated :)

From the docs it is also not 100% clear to me if dynamic parallelism and block-wide directives are supported but I couldn't find any particular info on that.

Have a nice day Lukas

gevtushenko commented 2 years ago

Hello, @lkskstlr! Thank you for your feedback. Unfortunately, your code snippet is insufficient to reproduce this error. I've extracted CUB related parts in the following code:

#include <thrust/device_vector.h>
#include <cub/block/block_reduce.cuh>
#include <iostream>

template<int ThreadsPerBlock,
         int ItemsPerThread>
__global__ void kernel(float *data) {
    typedef cub::BlockReduce<double, ThreadsPerBlock, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce;
    __shared__ typename BlockReduce::TempStorage temp_storage;

    double aggregates[ItemsPerThread];
    for (int idx = 0; idx < 7; idx++) {
        aggregates[idx] = BlockReduce(temp_storage).Sum(data[idx * ThreadsPerBlock + threadIdx.x]);
        __syncthreads(); // Needed due to temp_storage reuse
    }

    if (threadIdx.x == 0) {
        printf("%d\n", (int) sizeof(BlockReduce::TempStorage));
        for (int idx = 0; idx < 7; idx++) {
            printf("agg=%f\n", aggregates[idx]);
        }
    }
}

constexpr int items_per_thread = 7;
constexpr int threads_per_block = 512;
constexpr int elements = items_per_thread * threads_per_block;

__global__ void launcher(float *data) {
    if (threadIdx.x == 0) {
        kernel<threads_per_block, items_per_thread><<<1, threads_per_block>>>(data);

        if (cudaGetLastError() != cudaSuccess) {
            printf("CUDA Error!");
        }

        if (cudaDeviceSynchronize() != cudaSuccess) {
            printf("CUDA Error!");
        }
    }
    __syncthreads();
}

int main(void) {
    thrust::device_vector<float> in(elements);
    thrust::sequence(in.begin(), in.end());

    launcher<<<1, 96>>>(thrust::raw_pointer_cast(in.data()));
    launcher<<<1, 96>>>(thrust::raw_pointer_cast(in.data()));
    launcher<<<1, 96>>>(thrust::raw_pointer_cast(in.data()));
}

Please, let me know if it reproduces the described issue on your setup. It seems to work fine on mine. If the code above doesn't represent your case, please, feel free to update it here. If this code represents your case, I believe that the issue is in the code parts you've omitted.

alliepiper commented 2 years ago

@lkskstlr Any updates for this? See @senior-zero's questions above.

lkskstlr commented 2 years ago

@allisonvacanti sorry, I missed the answer on this issue. I will see if I can still reproduce the bug. Thanks for following up :)

NVIDIA / cccl

Launch failure potentially based on `cub::BlockReduce<double, TPB, cub::BlockReduceAlgorithm::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce` #882