Open Quuxplusone opened 6 years ago
Bugzilla Link | PR37716 |
Status | NEW |
Importance | P enhancement |
Reported by | Mathias Stearn (redbeard0531@gmail.com) |
Reported on | 2018-06-06 13:12:33 -0700 |
Last modified on | 2021-05-19 14:08:27 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | bigcheesegs@gmail.com, bruno.cardoso@gmail.com, comexk@gmail.com, dgregor@apple.com, ditaliano@apple.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-bugzilla@jfbastien.com, llvm-dev@redking.me.uk, post+llvm@ralfj.de |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also | PR46719 |
We already lower consecutive fences (https://reviews.llvm.org/D29314) so the request is not unreasonable. If somebody wants to take a stab at this, it should follow the same pattern I used there (i.e. a peephole in instcombine, or instSimplify as this doesn't introduce new instructions).
FWIW, the ARM backend has some pass for strength reducing atomic intrinsics as well, although I think these transformation should be target-independent.
We can definitely do this (see http://wg21.link/n4455), but is this something that you've seen happen in real code? I totally believe it can happen, but I'm not super interested in optimizing code which never happens :-)
Also, consider http://wg21.link/P0062 when doing any of this.
This seems to be affecting performance in some real Rust code. The code has an atomic pointer field where the least significant bit is used to store a flag. The pointer value can change, but all writers are guaranteed to preserve the value of the lowest bit. Thus, the function that reads that flag can use a relaxed atomic load. However, the author instead opted to use a non-atomic load, even though that's UB, because "simple benchmarks show up to a 10% slowdown using a Relaxed
atomic load on x86". This is probably because multiple redundant loads are not being merged.
Details at: https://internals.rust-lang.org/t/bit-wise-reasoning-for-atomic-accesses/8853
Also, although this isn't real code, I can easily imagine this coming up with "init once"-type loads. For example, in this C++ code:
extern int calculate_magic_number(); static int multiply_by_magic_number(int a) { static int magic = calculate_magic_number(); return a * magic; } int get_val() { return multiply_by_magic_number(2) + multiply_by_magic_number(3); }
multiply_by_magic_number
uses an atomic load to test whether magic
was already initialized. This is then inlined twice into get_val
. Ideally, we would only need one load of the guard variable, at least in the fast path where it has already been initialized.
Other missing optimizations on atomics:
Dead store elimination:
#include <atomic>
void test() {
std::atomic<int> dummy;
dummy.store(1, std::memory_order_relaxed);
}
test():
mov DWORD PTR [rsp-4], 1
ret
Removing unused loads:
#include <atomic>
void test2(std::atomic<int> &a) {
a.load(std::memory_order_relaxed);
}
test2(std::atomic<int>&):
mov eax, DWORD PTR [rdi]
ret
In some cases it might be necessary to leave a fence or compiler barrier, but
the load instruction itself can still usually be removed. Relaxed loads can be
completely removed.
> Relaxed loads can be completely removed.
Is that true? In the relaxed load might read from a relaxed stored that comes
after a release fence, and there might be an acquire fence on our side later --
and that way, a relaxed operation *can* take part in synchronization.
Hmm… I thought it was true. From an implementation perspective it seems like
it should be true; it seems unlikely that removing an unused load (of non-MMIO
memory) could actually change behavior. But from a spec perspective, after
trying to come up with a proof that it’s true, I’ve instead convinced myself
that it’s not! There’s one edge case where the semantics of
atomic_thread_fence are slightly too weak to allow removing relaxed loads.
Suppose that thread A atomically stores a new value to a variable, using either
a release write or a weaker write preceded by a release fence. Thread B
performs a relaxed load from the variable, but ignores the result (which is why
we want to remove the load); it then performs an acquire fence. *If* the load
observes the new value, then thread A’s release operation synchronizes-with
thread B’s fence.
The program cannot directly determine whether or not the load observes the new
value, because it ignores the loaded value. Therefore, in a given execution,
if the load was not required to observe the new value, we can just say it
didn’t. In that case, the fence doesn’t create a synchronization requirement,
so removing the load is definitely safe.
That leaves cases where the load *was* required to observe the new value.
There are two reasons why that might happen:
1. Because some other synchronization has caused thread A’s store to become a
visible side-effect in thread B. For example, there could be a store-release
in thread A after the store in question, paired with a load-acquire in thread B
before the load in question.
In that case, that other synchronization already provides a superset of what we
want from thread B’s fence. (If we switch from release/acquire to seq_cst,
things get a little more complicated, but removing the load should still be
okay.)
2. Because of read-read coherence: there was a previous atomic load which
happened-before the one at issue and already observed the new value.
*If* that previous atomic load was also on thread B, then it can pair with
thread B’s fence, as a substitute for the load we want to remove, again
resulting in a superset of the synchronization we need.
But if it was on a different thread, it can’t. According to both the C++ and
LLVM specs, an atomic read can only pair with a fence if it is sequenced-before
the fence, which requires being on the same thread. It’s not enough to just
happen-before the fence.
C++17 32.9.4:
> An atomic operation A that is a release operation on an atomic object M
synchronizes with an acquire fence B if there exists some atomic operation X on
M such that X is sequenced before B and reads the value written by A or a value
written by any side effect in the release sequence headed by A.
LLVM:
> A fence A which has (at least) release ordering semantics synchronizes with a
fence B with (at least) acquire ordering semantics if and only if there exist
atomic operations X and Y, both operating on some atomic object M, such that A
is sequenced before X, X modifies M (either directly or through some side
effect of a sequence headed by X), Y is sequenced before B, and Y observes M.
Therefore, in the following program, if r_sc == 1, thread A has synchronized-
with thread B, but that wouldn’t be true if the unused load were removed:
Thread A:
x.store(1, memory_order_release);
Thread B:
int r_sc = side_channel.load(memory_order_acquire);
// Unused relaxed load:
x.load(memory_order_relaxed);
// …that might synchronize with this fence:
atomic_thread_fence(memory_order_acquire);
Thread C:
int r_x = x.load(memory_order_relaxed);
side_channel.store(r_x, memory_order_release);
I verified this using CDSChecker with the following test case:
https://gist.github.com/comex/4bdb6c40bd31cea52f860b3bb057f0d2
This is annoying. Relaxed atomics can be seen as a sort of ‘zero-overhead
abstraction’ that expose the atomicity guarantees afforded to all accesses by
the underlying architecture. Not being able to perform optimizations makes
them less suitable for that role. Maybe we need a new, weaker ordering…
On the other hand, good news: Clang actually does remove useless atomic local
variables, as in my first example in comment 4. When I was testing it in
Compiler Explorer I accidentally picked GCC. *facepalm*
Interesting, I didn't think coming up with a counterexample would be quite so hard. E.g. in this paper they just say
. Incontrast, eliminating relaxed or acquire reads is not generally soundbecause it may remove synchronization.