Poor codegen with atomics

Quuxplusone commented 6 years ago


Bugzilla Link	PR37716
Status	NEW
Importance	P enhancement
Reported by	Mathias Stearn (redbeard0531@gmail.com)
Reported on	2018-06-06 13:12:33 -0700
Last modified on	2021-05-19 14:08:27 -0700
Version	trunk
Hardware	PC Linux
CC	bigcheesegs@gmail.com, bruno.cardoso@gmail.com, comexk@gmail.com, dgregor@apple.com, ditaliano@apple.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-bugzilla@jfbastien.com, llvm-dev@redking.me.uk, post+llvm@ralfj.de
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR46719

https://godbolt.org/g/QcLkce

#include <atomic>

std::atomic<int> progress{-1};

void combine_writes1() {
    // These should be reduced to a single store(0,release),
    // At least as long as release-sequence includes same-thread
    // relaxed stores. If that is removed, this should just be
    // a single relaxed store.
    progress.store(0, std::memory_order_relaxed);
    progress.store(0, std::memory_order_relaxed);
    progress.store(0, std::memory_order_release);
    progress.store(0, std::memory_order_release);
    progress.store(0, std::memory_order_relaxed);
    progress.store(0, std::memory_order_relaxed);
}

void combine_writes2() {
    // Ditto above, but should store 5.
    progress.store(0, std::memory_order_relaxed);
    progress.store(1, std::memory_order_relaxed);
    progress.store(2, std::memory_order_release);
    progress.store(3, std::memory_order_release);
    progress.store(4, std::memory_order_relaxed);
    progress.store(5, std::memory_order_relaxed);
}

void combine_loads() {
    // These should be reduced to either a single acquire-load
    // or an acquire fence.
    progress.load(std::memory_order_relaxed);
    progress.load(std::memory_order_relaxed);
    progress.load(std::memory_order_acquire);
    progress.load(std::memory_order_acquire);
    progress.load(std::memory_order_relaxed);
    progress.load(std::memory_order_relaxed);
}

combine_writes1(): # @combine_writes1()
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  ret
combine_writes2(): # @combine_writes2()
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 1
  mov dword ptr [rip + progress], 2
  mov dword ptr [rip + progress], 3
  mov dword ptr [rip + progress], 4
  mov dword ptr [rip + progress], 5
  ret
combine_loads(): # @combine_loads()
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  ret
progress:
  .long 4294967295 # 0xffffffff

Possibly related to https://bugs.llvm.org/show_bug.cgi?id=37690

Quuxplusone commented 6 years ago

We already lower consecutive fences (https://reviews.llvm.org/D29314) so the request is not unreasonable. If somebody wants to take a stab at this, it should follow the same pattern I used there (i.e. a peephole in instcombine, or instSimplify as this doesn't introduce new instructions).

FWIW, the ARM backend has some pass for strength reducing atomic intrinsics as well, although I think these transformation should be target-independent.

Quuxplusone commented 6 years ago

We can definitely do this (see http://wg21.link/n4455), but is this something that you've seen happen in real code? I totally believe it can happen, but I'm not super interested in optimizing code which never happens :-)

Also, consider http://wg21.link/P0062 when doing any of this.

Quuxplusone commented 6 years ago

This seems to be affecting performance in some real Rust code. The code has an atomic pointer field where the least significant bit is used to store a flag. The pointer value can change, but all writers are guaranteed to preserve the value of the lowest bit. Thus, the function that reads that flag can use a relaxed atomic load. However, the author instead opted to use a non-atomic load, even though that's UB, because "simple benchmarks show up to a 10% slowdown using a Relaxed atomic load on x86". This is probably because multiple redundant loads are not being merged.

Details at: https://internals.rust-lang.org/t/bit-wise-reasoning-for-atomic-accesses/8853

Also, although this isn't real code, I can easily imagine this coming up with "init once"-type loads. For example, in this C++ code:

extern int calculate_magic_number(); static int multiply_by_magic_number(int a) { static int magic = calculate_magic_number(); return a * magic; } int get_val() { return multiply_by_magic_number(2) + multiply_by_magic_number(3); }

multiply_by_magic_number uses an atomic load to test whether magic was already initialized. This is then inlined twice into get_val. Ideally, we would only need one load of the guard variable, at least in the fast path where it has already been initialized.