Status of atomicAddNoRet

al42and commented 2 years ago

Hello!

I would like to inquire about the state of the atomicAddNoRet function. It gives our code (GROMACS) a 2x speed-up in one of the kernels when running on MI100 (gfx908), compared to a plain atomicAdd (which gets compiled into a CAS-loop). So, I would really like to keep using the noret version, since the return value is anyway ignored.

However, atomicAddNoRet is marked as deprecated, and a plain atomicAdd is suggested instead (with no indications of possible performance degradation, by the way!). Could you please advise on what function should be used? I also considered using the __ockl_atomic_add_noret_f32 intrinsic directly, but it's also not documented.

We are using with ROCm 4.5.2 and hipSYCL for our code. However, the problem is easily demonstrated with the plain HIP (ROCm 4.5.2 and 5.0.0 tested):

#include "hip/hip_runtime.h"

__global__ void atomicAddKernel(float *__restrict__ a) {
  // Return value is clearly unused.
  atomicAdd(a, 1);
}

__global__ void atomicAddNoRetKernel(float *__restrict__ a) {
  // 'atomicAddNoRet' is deprecated: use atomicAdd instead
  atomicAddNoRet(a, 1);
}

int main() { return 0; }

$ hipcc --version
HIP version: 4.4.21432-f9dccde4
AMD clang version 13.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-4.5.2 21432 9bbd96fd1936641cd47defd8022edafd063019d5)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.5.2/llvm/bin
$ hipcc --offload-arch=gfx908 -O3 test.cpp -save-temps -Wno-deprecated-declarations

Examining the test-hip-amdgcn-amd-amdhsa-gfx908.s file, we see that _Z15atomicAddKernelPf contains a loop of global_atomic_cmpswap, while _Z20atomicAddNoRetKernelPf only has one nice little global_atomic_add_f32 call.

b-sumner commented 2 years ago

@al42and there are restrictions on when global_atomic_add_f32 can be used, so the compiler can't generate it by default. You can either get it, as you indicate, via a call to atomicAddNoRet, or by adding -munsafe-fp-atomics to the compiler options.

al42and commented 2 years ago

Thanks for the reply, @b-sumner!

So, atomicAddNoRet can be relied upon and is not going away in the near future despite being deprecated?

b-sumner commented 2 years ago

@al42and correct, it is not going away in the near future.

al42and commented 2 years ago

@b-sumner, Thank you for confirming!

One more question, if allowed by NDA: shall atomicAddNoRet be used on all hardware, or would you recommend only using it for gfx908, which has limited hardware atomics? Does it offer any benefits on gfx90a over plain atomicAdd? For gfx906, both seem to get compiled to CAS loop, so no difference here.

b-sumner commented 2 years ago

@al42and I'd suggest using atomicAddNoRet() only on gfx908. On gfx90a, only, unsafeAtomicAdd() can be used instead (and supports a return value). There is a double overload in addition to the float overload. But I still encourage -munsafe-fp-atomics so you can use the standard atomicAdd().

al42and commented 2 years ago

But I still encourage -munsafe-fp-atomics so you can use the standard atomicAdd().

The problem with this solution for me is that I'm working on a pretty large codebase. Currently, this option can be enabled just fine because the return value from atomicAdd is not used anywhere.

But introducing an option that alters a major way (different return value) the behavior of a common function globally but only on certain hardware (MI100) is very dangerous long-term. We might introduce a new kernel or add a library that relies on a standard-compliant behavior of atomicAdd, and then have a hard time finding the bug.

b-sumner commented 2 years ago

Note that If the return value is used, then the MI-100 no-return atomic add instruction won't be generated with -munsafe-fp-atomics.

al42and commented 2 years ago

Note that If the return value is used, then the MI-100 no-return atomic add instruction won't be generated with -munsafe-fp-atomics.

Oh, that's great news!

My questions are answered, but I think it might be good if the topic of FP atomic support was more elaborated in the docs. I saw some scattered mentions that they are not supported on AMD hardware, but any deeper info (e.g., that the "noret" version exists and that -munsafe-fp-atomics can improve things dramatically) is hard to discover.

b-sumner commented 2 years ago

Thanks, I'll pass this along.

pszi1ard commented 2 years ago

there are restrictions on when global_atomic_add_f32 can be used, so the compiler can't generate it by default.

@b-sumner can you please help with clarifying things a bit further, the atomics support and intrinsics are unfortunately frustratingly undocumented by AMD. Do I understand correctly that atomicAddNoRet() is not just "noret" but it also has non IEEE 754-compliant behavior? If so, it would be probably best if it is called unsafeAtomicAddNoRet() as it differs from atomicAdd not only in that it does not return.

In addition, you suggest that unsafeAtomicAdd() does not work on gfx908. Would it not make more sense to allow unsafeAtomicAdd() on all uarch and let the compiler infer whether the return value is used or not (and if it is not emit __ockl_atomic_add_noret_f32)?

b-sumner commented 2 years ago

@pszi1ard, the documentation issue is known and steps are being taken to improve it.

Regarding IEEE 754 compilance, note that C++20 states that he floating-point environment for atomic arithmetic operations on floating-point may be different than the calling thread’s floating-point environment. You should already be aware that GPU hardware floating point atomics frequently flush subnormal values to 0 and may have other differences.

But the main issue here is that for the devices that support them, non shared memory atomic floating point add is implemented in the device L2 cache and if the pointed-to memory is not cacheable, the add may have no effect. The compiler has no control over where the pointer is pointing, so it is up to the developer to assert that they accept this behavior either by using atomicAddNoRet (gfx908) or unsafeAtomicAdd (gfx90a) or use -munsafe_fp_atomics.

ROCm / hipamd

Status of atomicAddNoRet #19