Closed al42and closed 2 years ago
@al42and there are restrictions on when global_atomic_add_f32 can be used, so the compiler can't generate it by default. You can either get it, as you indicate, via a call to atomicAddNoRet, or by adding -munsafe-fp-atomics to the compiler options.
Thanks for the reply, @b-sumner!
So, atomicAddNoRet
can be relied upon and is not going away in the near future despite being deprecated?
@al42and correct, it is not going away in the near future.
@b-sumner, Thank you for confirming!
One more question, if allowed by NDA: shall atomicAddNoRet
be used on all hardware, or would you recommend only using it for gfx908, which has limited hardware atomics? Does it offer any benefits on gfx90a over plain atomicAdd
? For gfx906, both seem to get compiled to CAS loop, so no difference here.
@al42and I'd suggest using atomicAddNoRet() only on gfx908. On gfx90a, only, unsafeAtomicAdd() can be used instead (and supports a return value). There is a double overload in addition to the float overload. But I still encourage -munsafe-fp-atomics so you can use the standard atomicAdd().
But I still encourage -munsafe-fp-atomics so you can use the standard atomicAdd().
The problem with this solution for me is that I'm working on a pretty large codebase. Currently, this option can be enabled just fine because the return value from atomicAdd
is not used anywhere.
But introducing an option that alters a major way (different return value) the behavior of a common function globally but only on certain hardware (MI100) is very dangerous long-term. We might introduce a new kernel or add a library that relies on a standard-compliant behavior of atomicAdd
, and then have a hard time finding the bug.
Note that If the return value is used, then the MI-100 no-return atomic add instruction won't be generated with -munsafe-fp-atomics.
Note that If the return value is used, then the MI-100 no-return atomic add instruction won't be generated with -munsafe-fp-atomics.
Oh, that's great news!
My questions are answered, but I think it might be good if the topic of FP atomic support was more elaborated in the docs. I saw some scattered mentions that they are not supported on AMD hardware, but any deeper info (e.g., that the "noret" version exists and that -munsafe-fp-atomics
can improve things dramatically) is hard to discover.
Thanks, I'll pass this along.
there are restrictions on when global_atomic_add_f32 can be used, so the compiler can't generate it by default.
@b-sumner can you please help with clarifying things a bit further, the atomics support and intrinsics are unfortunately frustratingly undocumented by AMD.
Do I understand correctly that atomicAddNoRet()
is not just "noret" but it also has non IEEE 754-compliant behavior? If so, it would be probably best if it is called unsafeAtomicAddNoRet()
as it differs from atomicAdd
not only in that it does not return.
In addition, you suggest that unsafeAtomicAdd()
does not work on gfx908. Would it not make more sense to allow unsafeAtomicAdd()
on all uarch and let the compiler infer whether the return value is used or not (and if it is not emit __ockl_atomic_add_noret_f32)?
@pszi1ard, the documentation issue is known and steps are being taken to improve it.
Regarding IEEE 754 compilance, note that C++20 states that he floating-point environment for atomic arithmetic operations on floating-point may be different than the calling thread’s floating-point environment. You should already be aware that GPU hardware floating point atomics frequently flush subnormal values to 0 and may have other differences.
But the main issue here is that for the devices that support them, non shared memory atomic floating point add is implemented in the device L2 cache and if the pointed-to memory is not cacheable, the add may have no effect. The compiler has no control over where the pointer is pointing, so it is up to the developer to assert that they accept this behavior either by using atomicAddNoRet (gfx908) or unsafeAtomicAdd (gfx90a) or use -munsafe_fp_atomics.
Hello!
I would like to inquire about the state of the
atomicAddNoRet
function. It gives our code (GROMACS) a 2x speed-up in one of the kernels when running on MI100 (gfx908), compared to a plainatomicAdd
(which gets compiled into a CAS-loop). So, I would really like to keep using the noret version, since the return value is anyway ignored.However,
atomicAddNoRet
is marked as deprecated, and a plainatomicAdd
is suggested instead (with no indications of possible performance degradation, by the way!). Could you please advise on what function should be used? I also considered using the__ockl_atomic_add_noret_f32
intrinsic directly, but it's also not documented.We are using with ROCm 4.5.2 and hipSYCL for our code. However, the problem is easily demonstrated with the plain HIP (ROCm 4.5.2 and 5.0.0 tested):
Examining the
test-hip-amdgcn-amd-amdhsa-gfx908.s
file, we see that_Z15atomicAddKernelPf
contains a loop ofglobal_atomic_cmpswap
, while_Z20atomicAddNoRetKernelPf
only has one nice littleglobal_atomic_add_f32
call.