NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.55k stars 947 forks source link

[QST]Atomic addition for cute::half_t #1847

Closed jpli02 closed 3 weeks ago

jpli02 commented 3 weeks ago

Hi,

I am trying to use cute tensor to save the sum of different row blocks of a Tensor. Different thread may write to same location when performing accumulated sum.

Given cute tensor A and B(both are cute::half_t type), we want to accumulate the sum on A. As the following code shows, I want to sum up A, how to avoid race condition?

    #pragma unroll
    for (int idx = 0; idx < size(A); ++idx) {
        A(idx) += B(idx);
    }  
    A.data() = A.data() + (-kBlockN);

Is there an atomicAdd supported for cute::half_t, I cannot use atomicAdd since it doesn't support half_t.

Thank you so much!

thakkarV commented 3 weeks ago

from https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-atom

Atomic operation with vector type:

atom{.sem}{.scope}{.global}.add{.level::cache_hint}.vec_32_bit.f32                  d, [a], b{, cache-policy};
atom{.sem}{.scope}{.global}.op.noftz{.level::cache_hint}.vec_16_bit.half_word_type  d, [a], b{, cache-policy};
atom{.sem}{.scope}{.global}.op.noftz{.level::cache_hint}.vec_32_bit.packed_type     d, [a], b{, cache-policy};

.sem =               { .relaxed, .acquire, .release, .acq_rel };
.scope =             { .cta, .cluster, .gpu, .sys };
.op =                { .add, .min, .max };
.half_word_type =    { .f16, .bf16 };
.packed_type =       { .f16x2, .bf16x2 };
.vec_16_bit =        { .v2, .v4, .v8 }
.vec_32_bit =        { .v2, .v4 };
.level::cache_hint = { .L2::cache_hint }