[BUG]: Invalid __global__ read when calling thurst sort

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Silent Failure

Component

Thrust

Describe the bug

Compute Sanitizer is flagging a thrust sort operation as having a Invalid __global__ read of size 16 bytes when sorting the the tuple of thrust::tuple<int32_t, int32_t, uint64_t, int8_t>. This specific combination of data types appears to trigger the issue as removing an element or changing the int8_t to a int32_t causes a clean compute sanitizer run.

Compute Sanitizer Output

========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 16 bytes
=========     at 0x14c0 in /home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/detail/uninitialized_copy.cuh:62:void cub::CUB_200600_700_800_NS::detail::uninitialized_copy_single<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, const thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> &, (int)0>(T1 *, T2 &&)
=========     by thread (161,0,0) in block (0,0,0)
=========     Address 0x7f815c001e28 is misaligned
=========     and is inside the nearest allocation at 0x7f815c000000 of size 24,000,000 bytes
=========     Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/block/block_exchange.cuh:543:void cub::CUB_200600_700_800_NS::BlockExchange<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, (int)256, (int)2, (bool)0, (int)1, (int)1, (int)0>::WarpStripedToBlocked<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>(const thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> (&)[2], T1 (&)[2], cub::CUB_200600_700_800_NS::Int2Type<(int)0>) [0x14c0]
=========     Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/block/block_exchange.cuh:993:void cub::CUB_200600_700_800_NS::BlockExchange<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, (int)256, (int)2, (bool)0, (int)1, (int)1, (int)0>::WarpStripedToBlocked<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>(const thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> (&)[2], T1 (&)[2]) [0x14c0]
=========     Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/block/block_load.cuh:980:void cub::CUB_200600_700_800_NS::BlockLoad<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, (int)256, (int)2, (cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (int)1, (int)1, (int)0>::LoadInternal<(cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (int)0>::Load<cub::CUB_200600_700_800_NS::CacheModifiedInputIterator<(cub::CUB_200600_700_800_NS::CacheLoadModifier)0, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, long>>(T1, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> (&)[2]) [0x14c0]
=========     Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/block/block_load.cuh:1122:void cub::CUB_200600_700_800_NS::BlockLoad<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, (int)256, (int)2, (cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (int)1, (int)1, (int)0>::Load<cub::CUB_200600_700_800_NS::CacheModifiedInputIterator<(cub::CUB_200600_700_800_NS::CacheLoadModifier)0, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, long>>(T1, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> (&)[2]) [0x13f0]
=========     Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/agent/agent_merge_sort.cuh:197:void cub::CUB_200600_700_800_NS::AgentBlockSort<cub::CUB_200600_700_800_NS::AgentMergeSortPolicy<(int)256, (int)2, (cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (cub::CUB_200600_700_800_NS::CacheLoadModifier)0, (cub::CUB_200600_700_800_NS::BlockStoreAlgorithm)4>, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, long, SortPred, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, cub::CUB_200600_700_800_NS::NullType>::consume_tile<(bool)0>(long, int) [0x13f0]
=========     Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/agent/agent_merge_sort.cuh:163:cub::CUB_200600_700_800_NS::AgentBlockSort<cub::CUB_200600_700_800_NS::AgentMergeSortPolicy<(int)256, (int)2, (cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (cub::CUB_200600_700_800_NS::CacheLoadModifier)0, (cub::CUB_200600_700_800_NS::BlockStoreAlgorithm)4>, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, long, SortPred, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, cub::CUB_200600_700_800_NS::NullType>::Process() [0xc0]
=========     Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/device/dispatch/dispatch_merge_sort.cuh:209:void cub::CUB_200600_700_800_NS::DeviceMergeSortBlockSortKernel<cub::CUB_200600_700_800_NS::DeviceMergeSortPolicy<thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>>::Policy600, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, long, SortPred, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, cub::CUB_200600_700_800_NS::NullType>(bool, T2, T3, T4, T5, T6, T8 *, T9 *, T7, cub::CUB_200600_700_800_NS::detail::vsmem_t) [0x10]
=========     Saved host backtrace up to driver entry point at kernel launch time

How to Reproduce

Source

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>

using ComplexTuple = thrust::tuple<int32_t, int32_t, uint64_t, int8_t>;
struct SortPred {
    __device__ __host__ __forceinline__
    bool operator()(const ComplexTuple& a, const ComplexTuple& b){
        return thrust::get<0>(a) < thrust::get<0>(b);
    }
};

int main() {
    constexpr uint32_t N = 1000000;
    thrust::host_vector<ComplexTuple> elements;
    elements.reserve(N);

    for(uint32_t i = 0; i < N; i++) {
        ComplexTuple px{
                N - i, i, N - i, static_cast<int8_t>(i % 2)
        };
        elements.push_back(px);
    }

    thrust::device_vector<ComplexTuple> d_elements(elements);
    thrust::sort(d_elements.begin(), d_elements.end(), SortPred{});
    std::cout << "Done.\n";
    return 0;
}

CMake:

cmake_minimum_required(VERSION 3.10)

set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
set(CMAKE_CUDA_ARCHITECTURES 70 80)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

include(cmake/CPM.cmake)

CPMAddPackage(
    NAME CCCL
    GITHUB_REPOSITORY nvidia/cccl
    GIT_TAG main
)

project(MyCudaProject LANGUAGES CXX CUDA)

# Find CUDA
find_package(CUDA REQUIRED)

include(cmake/Modules/CULITHO_ConfigureCUDA.cmake)
include(cmake/thirdparty/CULITHO_GetLibcudacxx.cmake)

# Set CUDA architecture (optional, adjust according to your GPU)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_80,code=sm_80)

# Add your CUDA files
set(CUDA_SOURCE_FILES
    main.cu
)

# Add executable
add_executable(testExec ${CUDA_SOURCE_FILES})
target_link_libraries(testExec PRIVATE 
    CCCL::CCCL
)
target_include_directories(testExec SYSTEM PRIVATE ${CUDA_INCLUDE_DIRS})

To reproduce:

 /usr/local/cuda-12.2/bin/compute-sanitizer --show-backtrace yes --tool memcheck --leak-check full ./testExec

Expected behavior

Compute Sanitizer report to be clean.

Reproduction link

No response

Operating System

Ubuntu 20.04.6

nvidia-smi output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:C1:00.0 Off |                    0 |
| N/A   32C    P0              62W / 300W |    576MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1864      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A   3386859      C   .../projects/ccclTest/buile/./testExec      548MiB |
+---------------------------------------------------------------------------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

NVIDIA / cccl

[BUG]: Invalid global read when calling thurst sort #2224

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

NVIDIA / cccl

[BUG]: Invalid __global__ read when calling thurst sort #2224

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

[BUG]: Invalid global read when calling thurst sort #2224