[ENHANCEMENT]: Enable `packed_cas` codepath using 16B CAS on sm_90+ architectures

sleeepyjack commented 1 month ago

The packed_cas update routine shows better performance compared to back_to_back_cas and cas_dependent_write.

On sm_90 and higher we have hardware support for 16B atomic CAS which we currently don't make use of.

16B atomicCAS was introduced with CUDA 12.3 (see docs).

Idea: Add a dedicated codepath for sm_90+ by adding something like

NV_IF_TARGET(some_target_that_means_sm_90_or_higher,
             atomicCAS(...) // 16B CAS,
             // pre-sm_90 code path);

Convince CCCL to expose cuda::atomic_ref::compare_exchange_* for 16B types ;)

No response

PointKernel commented 1 month ago

Convince CCCL to expose cuda::atomic_ref::compareexchange* for 16B types

+1

sleeepyjack commented 1 month ago

Convince CCCL to expose cuda::atomic_ref::compareexchange* for 16B types

NVIDIA / cuCollections