[BUG] 30TB query95 fails on the join with illegal memory access with 200 partitions

As a follow on to https://github.com/NVIDIA/spark-rapids/issues/6983, we ran the q95 query at 30TB with the fix in this PR (https://github.com/rapidsai/cudf/pull/12079) and we ended up failing during a couple of the joins later, an inner join and a left semi.

In both of those cases we are hitting instances of the overflowing strided loop issue in cuco's static_multimap::pair_count and static_map::insert (see compute-sanitizer output below). It looks like cuDF could work around this by using int64_t as the type in their counting_transform_iterator (like I did in this proof-of-concept), but it is not clear if that is the right solution. This issue is for our tracking, but the fix will be in cuDF or cuCollections.

The only current workaround is to increase our shuffle partitions (for example 400 partitions worked without issues).

Inner join:

========= Invalid __global__ read of size 4 bytes
=========     at 0x500 in void cuco::detail::pair_count<(unsigned int)128, (unsigned int)2, (bool)0, thrust::transform_iterator<cudf::detail::make_pair_function<cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>, int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, cuda::__4::atomic<unsigned long, (cuda::std::__4::__detail::thread_scope)1>, cuco::static_multimap<unsigned int, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char>>, cuco::double_hashing<(unsigned int)2, cudf::detail::MurmurHash3_32<unsigned int>, cudf::detail::MurmurHash3_32<unsigned int>>>::device_view, cudf::detail::pair_equality<cudf::row_equality_comparator<cudf::nullate::DYNAMIC>>>(T4, T4, T5 *, T6, T7)
=========     by thread (64,0,0) in block (14773391,0,0)
=========     Address 0xbcd89fe80 is out of bounds
=========     and is 1603745152 bytes before the nearest allocation at 0xc2d213400 of size 256 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x22da7a]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3deb04b]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame: [0x3e28798]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:unsigned long cuco::static_multimap<unsigned int, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char> >, cuco::double_hashing<2u, cudf::detail::MurmurHash3_32<unsigned int>, cudf::detail::MurmurHash3_32<unsigned int> > >::pair_count<thrust::transform_iterator<cudf::detail::make_pair_function<cudf::row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC>, int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, cudf::detail::pair_equality<cudf::row_equality_comparator<cudf::nullate::DYNAMIC> > >(thrust::transform_iterator<cudf::detail::make_pair_function<cudf::row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC>, int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::transform_iterator<cudf::detail::make_pair_function<cudf::row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC>, int>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, cudf::detail::pair_equality<cudf::row_equality_comparator<cudf::nullate::DYNAMIC> >, CUstream_st*) const [0x1e69a52]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:unsigned long cudf::detail::_GLOBAL__N__dbc92c90_12_hash_join_cu_cd66f71b::compute_join_output_size<cudf::detail::join_kind>(cudf::table_device_view, cudf::detail::_GLOBAL__N__dbc92c90_12_hash_join_cu_cd66f71b::compute_join_output_size<cudf::detail::join_kind>, cuco::static_multimap<unsigned int, int, cuda::std::__4::__detail::thread_scope, rmm::mr::stream_allocator_adaptor<default_allocator<char>>, cudf::table_device_view::double_hashing<unsigned int=2, cudf::detail::MurmurHash3_32<unsigned int>, cudf::detail::MurmurHash3_32>> const &, bool, cudf::null_equality, cuda::std::__4::__detail::thread_scope::cuda_stream_view) [0x1e69fcc]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:std::pair<std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector>>, std::default_delete<rmm::device_uvector>> cudf::detail::_GLOBAL__N__dbc92c90_12_hash_join_cu_cd66f71b::probe_join_hash_table<cudf::detail::join_kind>(cudf::table_device_view, std::pair<std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector>>, std::default_delete<rmm::device_uvector>>, cuco::static_multimap<unsigned int, int, cuda::std::__4::__detail::thread_scope, std::unique_ptr::mr::stream_allocator_adaptor<default_allocator<char>>, cudf::table_device_view::double_hashing<unsigned int=2, cudf::detail::MurmurHash3_32<unsigned int>, cudf::detail::MurmurHash3_32>> const &, bool, cudf::null_equality, std::optional<unsigned long>, std::unique_ptr::cuda_stream_view, cuda::std::__4::__detail::thread_scope::device_memory_resource*) [0x1e6f64f]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:std::pair<std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector<int> > >, std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector<int> > > > cudf::detail::hash_join<cudf::detail::MurmurHash3_32<unsigned int> >::probe_join_indices<(cudf::detail::join_kind)0>(cudf::table_view const&, std::optional<unsigned long>, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) const [0x1e6f7f2]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:std::pair<std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector<int> > >, std::unique_ptr<rmm::device_uvector<int>, std::default_delete<rmm::device_uvector<int> > > > cudf::detail::hash_join<cudf::detail::MurmurHash3_32<unsigned int> >::compute_hash_join<(cudf::detail::join_kind)0>(cudf::table_view const&, std::optional<unsigned long>, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) const [0x1e6face]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:cudf::hash_join::inner_join(cudf::table_view const&, std::optional<unsigned long>, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) const [0x1e679e3]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:cudf::detail::inner_join(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1e70633]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:cudf::inner_join(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::mr::device_memory_resource*) [0x1e70c5c]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame:Java_ai_rapids_cudf_Table_innerJoinGatherMaps [0x14ec5e3]
=========                in /tmp/cudf1750694697214535636.so
=========     Host Frame: [0x254ac96a7]
=========                in
=========

Leftsemi:

========= Invalid __global__ read of size 4 bytes
=========     at 0x440 in /spark-rapids-jni/thirdparty/cudf/cpp/include/cudf/column/column_device_view.cuh:431:T1 cudf::column_device_view::element<int, (void *)0>(int) const
=========     by thread (0,0,0) in block (29517103,0,0)
=========     Address 0xa4ba95700 is out of bounds
=========     and is 3222998784 bytes before the nearest allocation at 0xb0bc46600 of size 256 bytes
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/include/cudf/table/row_operators.cuh:538:unsigned int cudf::element_hasher_with_seed<cudf::detail::default_hash, cudf::nullate::DYNAMIC>::operator ()<int, (void *)0>(cudf::column_device_view, int) const [0x3f0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/include/cudf/utilities/type_dispatcher.hpp:455:decltype(auto) cudf::type_dispatcher<cudf::dispatch_storage_type, cudf::element_hasher_with_seed<cudf::detail::default_hash, cudf::nullate::DYNAMIC>, const cudf::column_device_view &, int &>(cudf::data_type, T2, T3 &&...) [0x2e0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/include/cudf/table/row_operators.cuh:605:cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>::operator ()(int) const [0x1c0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/src/search/contains_table.cu:71:auto cudf::detail::<unnamed>::strong_index_hasher_adapter<cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>>::operator ()<cudf::experimental::row::lhs_index_type, (void *)0>(T1) const [0x1c0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/cuco-src/include/cuco/static_map.cuh:510:cuco::pair<cuda::__4::atomic<cudf::experimental::row::lhs_index_type, (cuda::std::__4::__detail::thread_scope)1>, cuda::__4::atomic<int, (cuda::std::__4::__detail::thread_scope)1>> * cuco::static_map<cudf::experimental::row::lhs_index_type, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char>>>::device_view_base::initial_slot<cooperative_groups::__v1::thread_block_tile<(unsigned int)4, cooperative_groups::__v1::thread_block>, cudf::experimental::row::lhs_index_type, cudf::detail::<unnamed>::strong_index_hasher_adapter<cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>>>(const T1 &, const T2 &, T3) [0x1c0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/cuco-src/include/cuco/detail/static_map.inl:520:bool cuco::static_map<cudf::experimental::row::lhs_index_type, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char>>>::device_mutable_view::insert<cooperative_groups::__v1::thread_block_tile<(unsigned int)4, cooperative_groups::__v1::thread_block>, cudf::detail::<unnamed>::strong_index_hasher_adapter<cudf::row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>>, cudf::detail::<unnamed>::strong_index_comparator_adapter<cudf::row_equality_comparator<cudf::nullate::DYNAMIC>>>(const T1 &, const cuco::pair<cudf::experimental::row::lhs_index_type, int> &, T2, T3) [0xc0]
=========     Device Frame:/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/cuco-src/include/cuco/detail/static_map_kernels.cuh:154:void cuco::detail::insert<(unsigned long)128, (unsigned int)4, thrust::transform_iterator<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::detail::<unnamed>::contains_without_lists_or_nans(const _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::table_view &, const _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::table_view &, _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource *)::[lambda(T1) (instance 1)], thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, cuda::__4::atomic<unsigned long, (cuda::std::__4::__detail::thread_scope)1>, cuco::static_map<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::experimental::row::lhs_index_type, int, (cuda::std::__4::__detail::thread_scope)1, rmm::mr::stream_allocator_adaptor<default_allocator<char>>>::device_mutable_view, _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::detail::<unnamed>::strong_index_hasher_adapter<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::row_hasher<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::detail::default_hash, _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::nullate::DYNAMIC>>, _INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::detail::<unnamed>::strong_index_comparator_adapter<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::row_equality_comparator<_INTERNAL_b2e14aee_17_contains_table_cu_f61ccc2b_310::cudf::nullate::DYNAMIC>>>(T3, T3, T4 *, T5, T6, T7) [0xc0]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x22da7a]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3decaab]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame: [0x3e2a1f8]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cuco::static_map<cudf::experimental::row::lhs_index_type(void, cudf::experimental::row::lhs_index_type, cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::strong_index_comparator_adapter<cudf::row_equality_comparator<cudf::nullate>>, int, cuda::std::__4::__detail::thread_scope, CUstream_st*), int, cuda::std::__4::__detail::thread_scope, rmm::mr::stream_allocator_adaptor<default_allocator<char>>>::insert<thrust::transform_iterator<__nv_dl_wrapper_t<__nv_dl_tag<rmm::device_uvector<bool> (*) (cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*), __operator_&__(cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::contains_without_lists_or_nans(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)), unsigned int=1>>, thrust::counting_iterator<int, thrust::use_default, thrust::counting_iterator, thrust::counting_iterator>, thrust::counting_iterator, thrust::counting_iterator>, cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::strong_index_hasher_adapter<cudf::row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC>>, cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::strong_index_comparator_adapter<cudf::row_equality_comparator<cudf::nullate>>> [0x2a607ce]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cudf::detail::_GLOBAL__N__b2e14aee_17_contains_table_cu_f61ccc2b_310::contains_without_lists_or_nans(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a5ee26]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cudf::detail::contains(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, cudf::nan_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a5f222]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cudf::detail::left_semi_anti_join(cudf::detail::join_kind, cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1e82d72]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:cudf::left_semi_join(cudf::table_view const &, cudf::table_view const &, cudf::null_equality, rmm::mr::device_memory_resource*) [0x1e83a1c]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame:Java_ai_rapids_cudf_Table_leftSemiJoinGatherMap [0x14ebf33]
=========                in /tmp/cudf1875700103146174489.so
=========     Host Frame: [0x272ea7627]
=========                in

NVIDIA / spark-rapids

[BUG] 30TB query95 fails on the join with illegal memory access with 200 partitions #7036