Compute Sanitizer is flagging a thrust sort operation as having a Invalid __global__ read of size 16 bytes when sorting the the tuple of thrust::tuple<int32_t, int32_t, uint64_t, int8_t>. This specific combination of data types appears to trigger the issue as removing an element or changing the int8_t to a int32_t causes a clean compute sanitizer run.
Compute Sanitizer Output
========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 16 bytes
========= at 0x14c0 in /home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/detail/uninitialized_copy.cuh:62:void cub::CUB_200600_700_800_NS::detail::uninitialized_copy_single<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, const thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> &, (int)0>(T1 *, T2 &&)
========= by thread (161,0,0) in block (0,0,0)
========= Address 0x7f815c001e28 is misaligned
========= and is inside the nearest allocation at 0x7f815c000000 of size 24,000,000 bytes
========= Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/block/block_exchange.cuh:543:void cub::CUB_200600_700_800_NS::BlockExchange<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, (int)256, (int)2, (bool)0, (int)1, (int)1, (int)0>::WarpStripedToBlocked<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>(const thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> (&)[2], T1 (&)[2], cub::CUB_200600_700_800_NS::Int2Type<(int)0>) [0x14c0]
========= Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/block/block_exchange.cuh:993:void cub::CUB_200600_700_800_NS::BlockExchange<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, (int)256, (int)2, (bool)0, (int)1, (int)1, (int)0>::WarpStripedToBlocked<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>(const thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> (&)[2], T1 (&)[2]) [0x14c0]
========= Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/block/block_load.cuh:980:void cub::CUB_200600_700_800_NS::BlockLoad<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, (int)256, (int)2, (cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (int)1, (int)1, (int)0>::LoadInternal<(cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (int)0>::Load<cub::CUB_200600_700_800_NS::CacheModifiedInputIterator<(cub::CUB_200600_700_800_NS::CacheLoadModifier)0, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, long>>(T1, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> (&)[2]) [0x14c0]
========= Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/block/block_load.cuh:1122:void cub::CUB_200600_700_800_NS::BlockLoad<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, (int)256, (int)2, (cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (int)1, (int)1, (int)0>::Load<cub::CUB_200600_700_800_NS::CacheModifiedInputIterator<(cub::CUB_200600_700_800_NS::CacheLoadModifier)0, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, long>>(T1, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char> (&)[2]) [0x13f0]
========= Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/agent/agent_merge_sort.cuh:197:void cub::CUB_200600_700_800_NS::AgentBlockSort<cub::CUB_200600_700_800_NS::AgentMergeSortPolicy<(int)256, (int)2, (cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (cub::CUB_200600_700_800_NS::CacheLoadModifier)0, (cub::CUB_200600_700_800_NS::BlockStoreAlgorithm)4>, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, long, SortPred, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, cub::CUB_200600_700_800_NS::NullType>::consume_tile<(bool)0>(long, int) [0x13f0]
========= Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/agent/agent_merge_sort.cuh:163:cub::CUB_200600_700_800_NS::AgentBlockSort<cub::CUB_200600_700_800_NS::AgentMergeSortPolicy<(int)256, (int)2, (cub::CUB_200600_700_800_NS::BlockLoadAlgorithm)4, (cub::CUB_200600_700_800_NS::CacheLoadModifier)0, (cub::CUB_200600_700_800_NS::BlockStoreAlgorithm)4>, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, long, SortPred, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, cub::CUB_200600_700_800_NS::NullType>::Process() [0xc0]
========= Device Frame:/home/lpierce/scratch/projects/ccclTest/buile/_deps/cccl-src/cub/cub/device/dispatch/dispatch_merge_sort.cuh:209:void cub::CUB_200600_700_800_NS::DeviceMergeSortBlockSortKernel<cub::CUB_200600_700_800_NS::DeviceMergeSortPolicy<thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>>::Policy600, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, thrust::THRUST_200600_700_800_NS::detail::normal_iterator<thrust::THRUST_200600_700_800_NS::device_ptr<thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>>>, cub::CUB_200600_700_800_NS::NullType *, long, SortPred, thrust::THRUST_200600_700_800_NS::tuple<int, int, unsigned long, signed char>, cub::CUB_200600_700_800_NS::NullType>(bool, T2, T3, T4, T5, T6, T8 *, T9 *, T7, cub::CUB_200600_700_800_NS::detail::vsmem_t) [0x10]
========= Saved host backtrace up to driver entry point at kernel launch time
How to Reproduce
Source
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
using ComplexTuple = thrust::tuple<int32_t, int32_t, uint64_t, int8_t>;
struct SortPred {
__device__ __host__ __forceinline__
bool operator()(const ComplexTuple& a, const ComplexTuple& b){
return thrust::get<0>(a) < thrust::get<0>(b);
}
};
int main() {
constexpr uint32_t N = 1000000;
thrust::host_vector<ComplexTuple> elements;
elements.reserve(N);
for(uint32_t i = 0; i < N; i++) {
ComplexTuple px{
N - i, i, N - i, static_cast<int8_t>(i % 2)
};
elements.push_back(px);
}
thrust::device_vector<ComplexTuple> d_elements(elements);
thrust::sort(d_elements.begin(), d_elements.end(), SortPred{});
std::cout << "Done.\n";
return 0;
}
CMake:
cmake_minimum_required(VERSION 3.10)
set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
set(CMAKE_CUDA_ARCHITECTURES 70 80)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
include(cmake/CPM.cmake)
CPMAddPackage(
NAME CCCL
GITHUB_REPOSITORY nvidia/cccl
GIT_TAG main
)
project(MyCudaProject LANGUAGES CXX CUDA)
# Find CUDA
find_package(CUDA REQUIRED)
include(cmake/Modules/CULITHO_ConfigureCUDA.cmake)
include(cmake/thirdparty/CULITHO_GetLibcudacxx.cmake)
# Set CUDA architecture (optional, adjust according to your GPU)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_80,code=sm_80)
# Add your CUDA files
set(CUDA_SOURCE_FILES
main.cu
)
# Add executable
add_executable(testExec ${CUDA_SOURCE_FILES})
target_link_libraries(testExec PRIVATE
CCCL::CCCL
)
target_include_directories(testExec SYSTEM PRIVATE ${CUDA_INCLUDE_DIRS})
To reproduce:
/usr/local/cuda-12.2/bin/compute-sanitizer --show-backtrace yes --tool memcheck --leak-check full ./testExec
Expected behavior
Compute Sanitizer report to be clean.
Reproduction link
No response
Operating System
Ubuntu 20.04.6
nvidia-smi output
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:C1:00.0 Off | 0 |
| N/A 32C P0 62W / 300W | 576MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1864 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3386859 C .../projects/ccclTest/buile/./testExec 548MiB |
+---------------------------------------------------------------------------------------+
NVCC version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
This might be a compute sanitizer bug. Building with the same cuda version and running with the compute-sanitizer packaged in cuda-12.5 also generates a clean report.
Is this a duplicate?
Type of Bug
Silent Failure
Component
Thrust
Describe the bug
Compute Sanitizer is flagging a thrust sort operation as having a
Invalid __global__ read of size 16 bytes
when sorting the the tuple ofthrust::tuple<int32_t, int32_t, uint64_t, int8_t>
. This specific combination of data types appears to trigger the issue as removing an element or changing the int8_t to a int32_t causes a clean compute sanitizer run.Compute Sanitizer Output
How to Reproduce
Source
CMake:
To reproduce:
Expected behavior
Compute Sanitizer report to be clean.
Reproduction link
No response
Operating System
Ubuntu 20.04.6
nvidia-smi output
NVCC version